# US election Twitter mini Datathon (Advanced)
_Classify major events based on tweets sent by users on Twitter._

**Team name:** Data Guys

**Team members:** Akash Gupta, Ashwin Agarwal, Kevin Biju Mathew, Shruti Mahajan


## 1. Introduction <a class="anchor" id="one"></a>

In this datathon, we aim to perform fine-grained classification and classify tweets into 4 categories: BLM, Trump, Biden, and COVID.

The training data is presented in three formats: `.csv`, `.xlsx`, and `json`. We need to parse these files to get the complete training data. The testing data is available in `.csv` format.

We train our supervised machine learning model as per the training data and then use the model to classify the tweets into the given categories.

Overall, we perform the following tasks:

1. Read all the train data and test data into their respective dataframes.
2. Preprocess the tweets by:
  - Decoding (to `utf-8`) tweets that are encoded.
  - Removing non-english tweets from the training data.
  - Tokenizing and stemming the tweets in training and testing data.
  - Vectorizing the tweets using `TfidfVectorizer`.
3. Finally, we create a machine learning model by fitting the vector obtained, and then classify the tweets for the testing data.

More details for each task will be given in the following sections.

## Table of Contents

* [1. Introduction](#one)
* [2. Import libraries](#two)
* [3. Read and store train and test data](#three)
* [4. Prepare data for machine learning](#four)
* [5. Create the machine learning model](#five)
* [6. Classify each tweet in the test data](#six)

## 2.  Import libraries <a class="anchor" id="two"></a>

In this step, we import all the libraries required for this task.

In [1]:
# uncomment to install langid package if not available
# !pip -q install langid 

# Importing the required libraries
import pandas as pd   # import pandas for data frame
from nltk import *    # import nltk for tokenizer and stemming
from langid import classify   # used to classify the language of the tweets
import ast, re    # used for literal evaluation and regular expressions
from sklearn import svm   # supervised vector machine model for ML
from sklearn.feature_extraction.text import TfidfVectorizer   # used for TFIDF vectorization

We have successfully imported all the libraries required for our task.

## 3. Read and store train and test data <a class="anchor" id="three"></a>

Here we read all the files provided to us. The training data is read from `.csv`, `.xlsx`, and `.json` format files, while the test data is read from `.csv` format file.

In [2]:
# Concatenate all the data, drop duplicates, reset index, and drop rows that contain NA
# All the training data is read and stored in df_train dataframe
df_train = pd.concat([pd.read_csv("train_csv.csv"), 
                    pd.read_excel("train_excel.xlsx"), 
                    pd.read_json("train_json.json")])\
                    .drop_duplicates()\
                    .reset_index(drop = True)\
                    .dropna(axis = 0, how = 'any')

# Loading the test data
df_test = pd.read_csv("test_data_advanced.csv")

We can successfully read the data into the respective variables.

## 4. Prepare data for machine learning <a class="anchor" id="four"></a>

We must preprocess the data provided to us before performing any machine learning on it. In this step, we prepare our data for machine learning.

Many tweets in the train data set are in bytes format which indicates that they have been encoded. For our first step, we decode the tweets that are encoded.

In [3]:
def checkEncoding(tweet):
    '''
    In this function we use regex to find the tweets that are encoded.
    Then, we decode those tweets.
    param - 
    tweet: The tweets column from the dataframe.
    return - 
    tweet: The decoded version of the tweet (if encoded). Else the original tweet.
    '''
    if re.match(r"^b[\'\"]", tweet): # find tweets that start with b' or b"
        # Using literal to get the data in bytes so that it can be decoded
        tweet = ast.literal_eval(tweet).decode('utf-8')
    return tweet

# Apply the checkEncoding function on each tweet of the training data
df_train.tweet = df_train.apply(lambda x: checkEncoding(x.tweet), axis = 1)

# Apply the checkEncoding function on each tweet of the test data
df_test.tweet = df_test.apply(lambda x: checkEncoding(x.tweet), axis = 1)

We have successfully decoded all the tweets in both the train and test data.

Next, we remove the non-english tweets from our train data.

In [4]:
# Classify each tweet and remove tweets that are non-english
df_en = df_train[df_train.apply(lambda x: 'en' in classify(x['tweet'])[0], axis=1)]

Here we remove all tweets in the train data that are non-english.

Next, we tokenize all the tweets in the train and test data.

In [5]:
def processTweet(text):
    '''
    In this function we tokenize the tweets i.e. separate each word in the tweet
    params -
    text: Tweet that needs to be tokenized
    return -
    processed_text: Return the text after tokenizing and stemming
    '''
    # we use RegExp tokenizer to tokenize the text in the tweet
    processed_text = RegexpTokenizer(r'\w+').tokenize(text)

    # we use PorterStemmer to stem each token
    processed_text = [PorterStemmer().stem(word) for word in processed_text]

    return processed_text

# Apply the processTweet function on every tweet in the train data
df_en.tokens = df_en['tweet'].apply(lambda x: processTweet(x))

# Apply the processTweet function on every tweet in the test data
df_test.tokens = df_test['tweet'].apply(lambda x: processTweet(x))



We have successfully processed and prepared our data for machine learning. Our train and test data is now ready for machine learning.

## 5. Create the machine learning model <a class="anchor" id="five"></a>

In this step, we create the machine learning model and the TFIDF vector.

In [6]:
# Get the test and train data
X_train, X_test, y_train = df_en.tokens, df_test.tokens, df_en.label

# Creating the TFiDF vectoriser form
tfidf = TfidfVectorizer(analyzer = "word", ngram_range = (1, 2), 
                        stop_words = 'english', use_idf = True, max_df = 0.95, \
                        min_df = 0.001, sublinear_tf = True)

# Fit and transform the data on training data set
X_tr_tfidf = tfidf.fit_transform([' '.join(value) for value in X_train])

# Fit and transform the data on test data
X_test_tfidf = tfidf.transform([' '.join(value) for value in X_test])

We have successfully trained our model and created the TFIDF vector.

In the next step, we make prediction.

## 6. Classify each tweet in the test data <a class="anchor" id="six"></a>

In this step, we make predictions and classify the tweets in the test data into the given categories.

In [7]:
# Create the linear SVC model with params
# C, Kernel, degree, gamma
model = svm.SVC(C = 1.0, kernel = 'linear', degree = 3, gamma = 'auto')

# Fitting the SVC model to the training data
model.fit(X_tr_tfidf, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [8]:
# Make predictions on the test data
df_test['label'] = list(model.predict(X_test_tfidf))

# save prediction to the prediction.csv file
df_test[['Train_id', 'label']].to_csv('prediction.csv', index = False)

The predictions are saved to the `prediction.csv` file.