# Text Classification with Traditional Models
The Twitter dataset (`tweets.csv`) was collected in February of 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

You should build an end-to-end NLP pipeline to predict the sentiment class (i.e., positive, negative, or neutral) given a tweet. In particular, you should do the following:
- Load the `tweets` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end NLP pipeline, including a text representation model, such as [TF-IDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), and a traditional classification model, such as [naive bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).
- Optimize your pipeline by cross-validating your design decisions. 
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [28]:
import pandas as pd 
import nltk
import optuna
import numpy as np
import sklearn.model_selection
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import cross_val_score

In [4]:
df=pd.read_csv("/content/tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


# **Data pre-processing**

**1-lower case the letters**

In [5]:
text=df["text"][10]
print('the original tweet')
print(text)
print('the lower case tweet')
print(text.lower())


the original tweet
@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24
the lower case tweet
@virginamerica did you know that suicide is the second leading cause of death among teens 10-24


**2-remove symboles**

In [6]:
tokenizer=RegexpTokenizer(r'\w+|\$[\d\.]+|\S+@')
words=tokenizer.tokenize(text)
print(words)

['VirginAmerica', 'did', 'you', 'know', 'that', 'suicide', 'is', 'the', 'second', 'leading', 'cause', 'of', 'death', 'among', 'teens', '10', '24']


**3-steaming**

In [7]:
porter_stemmer=PorterStemmer()
stemmed_words=[porter_stemmer.stem(w) for w in words]
print(stemmed_words)

['virginamerica', 'did', 'you', 'know', 'that', 'suicid', 'is', 'the', 'second', 'lead', 'caus', 'of', 'death', 'among', 'teen', '10', '24']


**4- join the pre-processed words together**

In [8]:
preprocessed=" ".join(stemmed_words)
print(preprocessed)

virginamerica did you know that suicid is the second lead caus of death among teen 10 24


# **Apply all the pre-processing on all the data**

In [9]:
def preprocessor(text):
  #make lower case
  text=text.lower()
  #remove sympoles
  tokenizer=RegexpTokenizer(r'\w+|\$[\d\.]+|\S+@')
  words=tokenizer.tokenize(text)
  #stemming 
  porter_stemmer=PorterStemmer()
  stemmed_words=[porter_stemmer.stem(w) for w in words]

  return " ".join(stemmed_words)
df['reprocessed_data']=df['text'].apply(preprocessor)
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,reprocessed_data
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),virginamerica what dhepburn said
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),virginamerica plu you ve ad commerci to the ex...
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),virginamerica i didn t today must mean i need ...
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),virginamerica it s realli aggress to blast obn...
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),virginamerica and it s a realli big bad thing ...


# **Split the data into Train and Test**

In [10]:
x= df["reprocessed_data"]
y = df["airline_sentiment"]
x_train, x_test, y_train, y_test = sklearn. model_selection.train_test_split(x, y)

# **Vectorization**

In [11]:
vectorizer = sklearn. feature_extraction.text. TfidfVectorizer (min_df=5)
vectorizer.fit (x_train)
x_train = vectorizer.transform(x_train).toarray ()
x_test = vectorizer.transform(x_test).toarray ()
print("new x train:" ,x_train.shape)
print( "new × test:",x_test.shape)


new x train: (10980, 2174)
new × test: (3660, 2174)


# **MODEL TRAINING AND TUNNING HYPER PARAMETERS**

In [21]:
def objective(trial):
    # Define the search space for the hyperparameter
    var_smoothing = trial.suggest_loguniform('var_smoothing', 1e-10, 1e-6)
    
    # Create an instance of the Naive Bayes classifier with the suggested hyperparameter
    gnb = GaussianNB(var_smoothing=var_smoothing)
    
    # Perform cross-validation with accuracy as the scoring metric
    scores = cross_val_score(gnb, x_train, y_train, cv=5, scoring='accuracy')
    
    # Calculate the mean accuracy score
    mean_accuracy = scores.mean()
    
    return mean_accuracy

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print('The best accuracy achieved: {:.2f}'.format(study.best_value * 100))
print('The best hyperparameters: {}'.format(study.best_params))


[32m[I 2023-05-19 21:47:36,640][0m A new study created in memory with name: no-name-2827cb0a-d6f9-40a7-ae13-5afc4d92c859[0m
  var_smoothing = trial.suggest_loguniform('var_smoothing', 1e-10, 1e-6)
[32m[I 2023-05-19 21:47:39,426][0m Trial 0 finished with value: 0.4156648451730419 and parameters: {'var_smoothing': 1.911191312948378e-09}. Best is trial 0 with value: 0.4156648451730419.[0m
  var_smoothing = trial.suggest_loguniform('var_smoothing', 1e-10, 1e-6)
[32m[I 2023-05-19 21:47:42,177][0m Trial 1 finished with value: 0.4292349726775956 and parameters: {'var_smoothing': 8.515776486426551e-07}. Best is trial 1 with value: 0.4292349726775956.[0m
  var_smoothing = trial.suggest_loguniform('var_smoothing', 1e-10, 1e-6)
[32m[I 2023-05-19 21:47:46,448][0m Trial 2 finished with value: 0.41502732240437157 and parameters: {'var_smoothing': 7.859657281229854e-10}. Best is trial 1 with value: 0.4292349726775956.[0m
  var_smoothing = trial.suggest_loguniform('var_smoothing', 1e-10, 1

The best accuracy achieved: 43.04
The best hyperparameters: {'var_smoothing': 9.956116231087637e-07}


# **MODEL ASSESSMENT**

In [29]:
best_params = study.best_params

gnb = GaussianNB(var_smoothing=best_params['var_smoothing'])
gnb.fit(x_train, y_train)

y_predicted = gnb.predict(x_test)

accuracy = accuracy_score(y_test, y_predicted)
precision, recall, f1, support = precision_recall_fscore_support(y_test, y_predicted)
cm = confusion_matrix(y_test, y_predicted)

print('Accuracy: {:.2f}%'.format(accuracy * 100))
print('Precision:', precision)
print('Recall:', recall)
print('F1-score:', f1)
print('Confusion matrix:')
print(cm)

Accuracy: 40.03%
Precision: [0.86368843 0.27114967 0.24718499]
Recall: [0.33407178 0.31605563 0.75326797]
F1-score: [0.48178914 0.29188558 0.37222447]
Confusion matrix:
[[754 569 934]
 [ 71 250 470]
 [ 48 103 461]]
