### Task

1. Import twitter dataset of tweets into a DataFrame.
2. Keep only positive and negative tweets (so you exclude neutral). What is the percentage of positive/negative tweets?
3. Copy the text column into a Series X, and the sentiment column into a Series y. Apply a train test split with the random_state = 32.
4. Create a vectorizer model with scikit-learn using the TfidfVectorizer method. Train your model on X_train, then create a matrix of features X_train_CV. Create the X_test_CV matrix without re-training the model. The format of the X_test_CV matrix should be 4091x15806 with 44633 stored elements.
5. Now train a logistic regression with default parameters. You should get these scores: 0.932 for the train test, and 0.873 for the test set.



### Imports

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

### Export data

In [None]:
url = 'https://raw.githubusercontent.com/DaPlayfulQueen/DE_track_data/master/tweets.csv'
tweets = pd.read_csv(url)
print(f'Original tweet count is {tweets.shape[0]}')
tweets.head()

Original tweet count is 27480


Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


### Remove neutrals

In [None]:
tweets = tweets[tweets.sentiment != 'neutral']
print(f'Neg/pos tweet count is {tweets.shape[0]}')

Neg/pos tweet count is 16363


In [None]:
tweets.sentiment.value_counts()

positive    8582
negative    7781
Name: sentiment, dtype: int64

### Vectorize

In [None]:
X = tweets['text']
y = tweets['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=32,
                                                    train_size=0.75)

In [None]:
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(X_train)
X_train_CV = vectorizer.transform(X_train)
X_test_CV = vectorizer.transform(X_test)
X_test_CV

<4091x15806 sparse matrix of type '<class 'numpy.float64'>'
	with 44633 stored elements in Compressed Sparse Row format>

### Logistic regression

In [None]:
logistic_reg = LogisticRegression(max_iter=1000)
logistic_reg.fit(X_train_CV, y_train)

y_train_predict = logistic_reg.predict(X_train_CV)
train_accuracy_score = accuracy_score(y_train, y_train_predict)

y_test_predict = logistic_reg.predict(X_test_CV)
test_accuracy_score = accuracy_score(y_test, y_test_predict)

print(f'The accuracy score for train set is {round(train_accuracy_score, 3)}')
print(f'The accuracy score for test set is {round(test_accuracy_score, 3)}')

The accuracy score for train set is 0.932
The accuracy score for test set is 0.873
