### Task

1. Import twitter dataset of tweets into a DataFrame.
2. Keep only the positive and negative tweets (so you exclude the neutral). What is the percentage of positive/negative tweets?
3. Create a clean function to get read of redundant word forms and punctuation
4. Copy the clean column into a Series X, and the sentiment column into a Series y. Apply a split test-train with the training set size at 0.75 with random_state = 32.
5. Apply a CountVectorizer and train classification models.
6. Apply a TfidfVectorizer and train classification models.
7. Compare the scores, which parameters give the best scores?

Bonus: now it's your turn to improve your model:

8. By looking for model parameters : by gridsearch and crossvalidation for example
9. By changing the preparation of the text: for example some punctuations can help the model, the exclamation mark in particular.

### Imports

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

ImportError: ignored

In [None]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

### Export data

In [None]:
url = 'https://raw.githubusercontent.com/DaPlayfulQueen/DE_track_data/master/tweets.csv'
tweets = pd.read_csv(url)
print(f'Original tweet count is {tweets.shape[0]}')
tweets.head()

Original tweet count is 27480


Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


### Remove neutrals

In [None]:
tweets = tweets[tweets.sentiment != 'neutral']
print(f'Neg/pos tweet count is {tweets.shape[0]}')

Neg/pos tweet count is 16363


In [None]:
tweets.sentiment.value_counts()

positive    8582
negative    7781
Name: sentiment, dtype: int64

In [None]:
counts = tweets.sentiment.value_counts()
pos_percentage = round(counts['positive'] / (counts['negative'] + counts['positive']) * 100, 2)
neg_percentage = round(counts['negative'] / (counts['negative'] + counts['positive']) * 100, 2)
print(f'The positive percentage is {pos_percentage}%, negative is {neg_percentage}%')

The positive percentage is 52.45%, negative is 47.55%


### Clean text


In [None]:
import spacy
import string

lemmatizer = spacy.load('en_core_web_sm')
stopwords = nltk.corpus.stopwords.words("english")

def clean(text: str) -> str:
  text = text.lower()
  tokens = [token.lemma_ for token in lemmatizer(text)]
  tokens = [token for token in tokens if token not in stopwords and token not in string.punctuation]
  return ' '.join(tokens)

In [None]:
tweets['lemmatized_text'] = tweets.text.apply(clean)

In [None]:
tweets.head(5)

Unnamed: 0,textID,text,selected_text,sentiment,lemmatized_text
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,sooo sad I miss san diego
2,088c60f138,my boss is bullying me...,bullying me,negative,boss bully I ...
3,9642c003ef,what interview! leave me alone,leave me alone,negative,interview leave I alone
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,son couldn`t put release already buy
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive,2 feeding baby fun smile coo


### Vectorize with CountVectorizer

In [None]:
X = tweets['lemmatized_text']
y = tweets['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=32,
                                                    train_size=0.75)

In [None]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X_train)
X_train_count = count_vectorizer.transform(X_train)
X_test_count = count_vectorizer.transform(X_test)
X_test_count

<4091x13521 sparse matrix of type '<class 'numpy.int64'>'
	with 27458 stored elements in Compressed Sparse Row format>

### Vectorize with Tfidf

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(X_train)
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
X_test_tfidf

<4091x13521 sparse matrix of type '<class 'numpy.float64'>'
	with 27458 stored elements in Compressed Sparse Row format>

### Logistic regression

In [None]:
def train_lr_and_get_score(X_train, X_test, y_train, y_test, lr_params={}):
  logistic_reg = LogisticRegression(max_iter=1000, **lr_params)
  logistic_reg.fit(X_train, y_train)

  y_train_predict = logistic_reg.predict(X_train)
  train_accuracy_score = accuracy_score(y_train, y_train_predict)

  y_test_predict = logistic_reg.predict(X_test)
  test_accuracy_score = accuracy_score(y_test, y_test_predict)

  return train_accuracy_score, test_accuracy_score

#### Count-vectorized data

In [None]:
train_accuracy_score_count, test_accuracy_score_count = train_lr_and_get_score(X_train_count, X_test_count, y_train, y_test)

#### Tfidf-vectorized data

In [None]:
train_accuracy_score_tfidf, test_accuracy_score_tfidf = train_lr_and_get_score(X_train_tfidf, X_test_tfidf, y_train, y_test)

### Models comparison

In [None]:
def display_score_comparison(train_count_score, test_count_score, train_tfidf_score, test_tfidf_score):
  display(pd.DataFrame({
    'Count accuracy score on train': [train_count_score],
    'Count accuracy score on test': [test_count_score],
    'Tfidf accuracy score on train': [train_tfidf_score],
    'Tfidf accuracy score on test': [test_tfidf_score],
  }))

In [None]:
display_score_comparison(train_accuracy_score_count, test_accuracy_score_count, train_accuracy_score_tfidf, test_accuracy_score_tfidf)

Unnamed: 0,Count accuracy score on train,Count accuracy score on test,Tfidf accuracy score on train,Tfidf accuracy score on test
0,0.954123,0.869225,0.928048,0.871425


### Applying GridSearch for better parameters

# Getting best params

In [None]:
def get_best_parameters(X_train, y_train):
  param_grid = {
      'penalty': ['l1', 'l2'],
      'C': [0.001, 0.01, 0.1, 1, 10, 100],
      'solver': ['liblinear', 'saga'],
      'fit_intercept': [True, False],
      'class_weight': [None, 'balanced'],
  }

  grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=2, scoring='accuracy', verbose=1)
  grid_search.fit(X_train_count, y_train)
  return grid_search.best_params_

In [None]:
count_best_params = get_best_parameters(X_train_count, y_train)

Fitting 2 folds for each of 96 candidates, totalling 192 fits




In [None]:
tfidf_best_params = get_best_parameters(X_train_tfidf, y_train)

Fitting 2 folds for each of 96 candidates, totalling 192 fits




In [None]:
train_accuracy_score_count2, test_accuracy_score_count2 = train_lr_and_get_score(X_train_count, X_test_count, y_train, y_test, count_best_params)
train_accuracy_score_tfidf2, test_accuracy_score_tfidf2 = train_lr_and_get_score(X_train_tfidf, X_test_tfidf, y_train, y_test, tfidf_best_params)
display_score_comparison(train_accuracy_score_count2, test_accuracy_score_count2, train_accuracy_score_tfidf2, test_accuracy_score_tfidf2)

Unnamed: 0,Count accuracy score on train,Count accuracy score on test,Tfidf accuracy score on train,Tfidf accuracy score on test
0,0.911587,0.865314,0.881356,0.864092


### Return some punctuation

I am going to add question marks and exclamation back

In [None]:
def clean_extra_punctuation(text: str) -> str:
  text = text.lower()
  tokens = [token.lemma_ for token in lemmatizer(text)]
  tokens = [token for token in tokens if token not in stopwords]
  return ' '.join(tokens)

In [None]:
tweets['exp_text'] = tweets.text.apply(clean_extra_punctuation)

In [None]:
tweets.tail(10)

Unnamed: 0,textID,text,selected_text,sentiment,lemmatized_text,exp_text
27464,c14a543497,Sure. I`ll try n keep that up! =P You enjoy s...,enjoy,positive,sure i`ll try n keep p enjoy study cya,sure . i`ll try n keep ! = p enjoy study . c...
27466,432e6de6c9,morning twit-friends! welcome to my new followers,welcome,positive,morning twit friend welcome new follower,morning twit - friend ! welcome new follower
27469,778184dff1,lol i know and haha..did you fall asleep?? o...,t bored,negative,lol I know haha .. fall asleep get bore sh...,lol I know haha .. fall asleep ? ? get bor...
27471,8f5adc47ec,http://twitpic.com/663vr - Wanted to visit the...,were too late,negative,http://twitpic.com/663vr want visit animal late,http://twitpic.com/663vr - want visit animal late
27473,8f14bb2715,So I get up early and I feel good about the da...,I feel good ab,positive,I get early I feel good day I walk work i`m fe...,I get early I feel good day . I walk work i`m ...
27474,b78ec00df5,enjoy ur night,enjoy,positive,enjoy ur night,enjoy ur night
27475,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative,wish could come see u denver husband lose ...,wish could come see u denver husband lose ...
27476,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative,i`ve wonder rake client make clear .net do...,i`ve wonder rake . client make clear .net ...
27477,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive,yay good enjoy break probably need hectic we...,yay good . enjoy break - probably need hecti...
27478,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive,worth,worth * * * * .


In [None]:
X = tweets['exp_text']
y = tweets['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=32,
                                                    train_size=0.75)

count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X_train)

X_train_count3 = count_vectorizer.transform(X_train)
X_test_count3 = count_vectorizer.transform(X_test)

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(X_train)

X_train_tfidf3 = tfidf_vectorizer.transform(X_train)
X_test_tfidf3 = tfidf_vectorizer.transform(X_test)

In [None]:
train_accuracy_score_count3, test_accuracy_score_count3 = train_lr_and_get_score(X_train_count3, X_test_count3, y_train, y_test)
train_accuracy_score_tfidf3, test_accuracy_score_tfidf3 = train_lr_and_get_score(X_train_tfidf3, X_test_tfidf3, y_train, y_test)
display_score_comparison(train_accuracy_score_count3, test_accuracy_score_count3, train_accuracy_score_tfidf3, test_accuracy_score_tfidf3)

Unnamed: 0,Count accuracy score on train,Count accuracy score on test,Tfidf accuracy score on train,Tfidf accuracy score on test
0,0.954123,0.869225,0.928048,0.871425


It is very odd. Of course, my best params made scores worse, as I usually cannot run GridSearch for too long, and the results are rather underwhelming. But for no punctuation - no accuracy score change! madness