### Task


1. Import this dataset of tweets into a DataFrame.
2. Keep only positive and negative tweets (so you exclude neutral). What is the percentage of positive/negative tweets?
3. Copy the text column into a Series X, and the sentiment column into a Series y. Apply a train test split with the random_state = 32 and train_size equal to 0.75.
4. Create a vectorizer model with scikit-learn using the Countvectorizer method. Train your model on X_train, then create a matrix of features X_train_CV. Create the X_test_CV matrix without re-training the model. The format of the X_test_CV matrix should be 4091x15806 with 44633 stored elements.
5. Now train a logistic regression with default parameters. You should get these scores: 0.966 for the train test, and 0.877 for the test set.
6. Bonus step: try to display 10 tweets that were badly predicted (false positive or false negative). Would you have done better than the algorithm?



### Imports

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

### Export data

In [None]:
url = 'https://raw.githubusercontent.com/DaPlayfulQueen/DE_track_data/master/tweets.csv'
tweets = pd.read_csv(url)
print(f'Original tweet count is {tweets.shape[0]}')
tweets.head()

Original tweet count is 27480


Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


### Remove neutrals

In [None]:
tweets = tweets[tweets.sentiment != 'neutral']
print(f'Neg/pos tweet count is {tweets.shape[0]}')

Neg/pos tweet count is 16363


In [None]:
counts = tweets.sentiment.value_counts()
pos_percentage = round(counts['positive'] / (counts['negative'] + counts['positive']) * 100, 2)
neg_percentage = round(counts['negative'] / (counts['negative'] + counts['positive'] * 100, 2)
print(f'The positive percentage is {pos_percentage}%, negative is {neg_percentage}%')

The positive percentage is 858201.1%, negative is 858201.0%


### Vectorize

In [None]:
X = tweets['text']
y = tweets['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=32,
                                                    train_size=0.75)

In [None]:
from nltk.corpus import stopwords
# vectorizer = CountVectorizer(lowercase=True,
#                              stop_words=stopwords.words('english'),
#                              ngram_range=(1, 2))


vectorizer = CountVectorizer()
vectorizer.fit_transform(X_train)
X_train_CV = vectorizer.transform(X_train)
X_test_CV = vectorizer.transform(X_test)
X_test_CV

<4091x15806 sparse matrix of type '<class 'numpy.int64'>'
	with 44633 stored elements in Compressed Sparse Row format>

### Logistic regression

In [None]:
logistic_reg = LogisticRegression(max_iter=1000)
logistic_reg.fit(X_train_CV, y_train)

y_train_predict = logistic_reg.predict(X_train_CV)
train_accuracy_score = accuracy_score(y_train, y_train_predict)

y_test_predict = logistic_reg.predict(X_test_CV)
test_accuracy_score = accuracy_score(y_test, y_test_predict)

print(f'The accuracy score for train set is {round(train_accuracy_score, 3)}')
print(f'The accuracy score for test set is {round(test_accuracy_score, 3)}')

The accuracy score for train set is 0.966
The accuracy score for test set is 0.877


### Worst 10 predictions

In [None]:
logistic_reg.classes_

array(['negative', 'positive'], dtype=object)

In [None]:
y_train_predict_probs = logistic_reg.predict_proba(X_train_CV)
y_test_predict_probs = logistic_reg.predict_proba(X_test_CV)

In [None]:
train_results_df = pd.DataFrame({
    'text': X_train,
    'actual': y_train,
    'predicted': y_train_predict,
    'neg_prob': y_train_predict_probs[:, 0],
    'pos_prob': y_train_predict_probs[:, 1]
})

test_results_df = pd.DataFrame({
    'text': X_test,
    'actual': y_test,
    'predicted': y_test_predict,
    'neg_prob': y_test_predict_probs[:, 0],
    'pos_prob': y_test_predict_probs[:, 1]
})
results_df = pd.concat([train_results_df, test_results_df], ignore_index=True)

false_results_df = results_df[results_df.actual != results_df.predicted]
false_results_df['highest_failed_prob'] = false_results_df.apply(lambda row: row['neg_prob'] if row['neg_prob'] > row['pos_prob'] else row['pos_prob'], axis=1)

pd.set_option('display.max_colwidth', None)
false_results_df.sort_values(by='highest_failed_prob', ascending=False).head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  false_results_df['highest_failed_prob'] = false_results_df.apply(lambda row: row['neg_prob'] if row['neg_prob'] > row['pos_prob'] else row['pos_prob'], axis=1)


Unnamed: 0,text,actual,predicted,neg_prob,pos_prob,highest_failed_prob
12346,have an amazing time with your mommas tomorrow! Show them how much they mean to you Whatever you do they will love it,negative,positive,9.7e-05,0.999903,0.999903
13585,I will not be late. I will not be late. I will not be late.,positive,negative,0.999782,0.000218,0.999782
12348,hoping i didn`t fail english. that would just be sad,positive,negative,0.995522,0.004478,0.995522
9007,it was fun Ate at mas.Thanks!! Good luck w/ ur showing tomorrow.,negative,positive,0.004569,0.995431,0.995431
15886,"Great, just great. #Cookoutofthecentury and my wife`s tummy hurts. Just. Great.",negative,positive,0.005116,0.994884,0.994884
15066,"alas, I am moving (like where i`m moving too, but the actual moving, ugh) wish I could go too!",positive,negative,0.994077,0.005923,0.994077
12594,yeah back to work I get out at 3:30 so it`s not that bad,positive,negative,0.99261,0.00739,0.99261
4909,on that note - i do not feel missed.,positive,negative,0.992185,0.007815,0.992185
15400,Thanks! My mom`s seed is larger and already cracked (and planted). I hope Avalina isn`t a dud!,negative,positive,0.009614,0.990386,0.990386
10572,Got the sniffles I SO don`t want to get sick - I don`t need this.,positive,negative,0.986676,0.013324,0.986676


So there are predictions which probabilities for their false result are very high, the worst 10. I would say, that I would argue about the actual classification there. But twitter - I am sorry - X folks are strange and in general. That's what happens when you let people project their stream of consciousness