# Tfidf

---

**Datasets:**
1. Dataset with apostrophe in regex and lemmatization
    - sentiment140/sentiment140_lem_all_with_apostrophe.csv
2. Dataset without apostrophe in regex and lemmatization
    - => partially less token in tweet compared to dataset with apostrophe
    - sentiment140/sentiment140_lem_all_no_apostrophe.csv
3. Dataset with apostrophe in regex and stemming
    - sentiment140/sentiment140_stem_all_with_apostrophe.csv
4. Dataset without apostrophe in regex and stemming
    - => partially less token in tweet compared to dataset with apostrophe
    - sentiment140/sentiment140_stem_all_no_apostrophe.csv

---
 For MAX_FEATURES = 5000 and 5% of the dataset, NGRAM_RANGE = (1,1)

**GridSearch Best Parameters**: 
1. {'metric': 'minkowski', 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}
2. {'metric': 'minkowski', 'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}
3. {'metric': 'minkowski', 'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}
4. {'metric': 'minkowski', 'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}



**Accuracy best parameters from GridSearch**:

|  | mit Apost. | ohne Apost. |
|--|--|--|
| Lem | 0.6134 | 0.6223 |
| Stem | 0.6159 | 0.6196 |

---

1. Datensatz:

**N_GRAM = (1,2)**:  
**5% + MAX_FEATURES = 5000**
- {'metric': 'minkowski', 'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}
    - test accuracy 0.61975
    - train accuracy 0.985046875

- {'metric': 'minkowski', 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}
    - test accuracy 0.6290625
    - train accuracy 0.986953125

**5% + MAX_FEATURES = 18000**
- {'metric': 'minkowski', 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}
    - test accuracy 0.60025
    - train accuracy 0.9915

**10% + MAX_FEATURES = 18000**
- {'metric': 'minkowski', 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}
    - test accuracy 0.62409375
    - train accuracy 0.9905

---

Goal: train full dataset with best parameters from GridSearch and 
- MAX_FEATURES = unlimited => MemoryError


In [42]:
# Imports

import pandas as pd
import numpy as np
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

pd.options.display.max_colwidth = None

## 1. Dataset with apostrophe in regex and lemmatization

In [43]:
DATA_NAME = 'sentiment140/sentiment140_lem_all_with_apostrophe.csv'
ENCODING = 'latin-1'
COLUMN_NAMES = ['sentiment', 'tweet']
NROWS = 1600000

df_with_apostrophe = pd.read_csv(DATA_NAME,
                 encoding=ENCODING,
                 header=None,
                 names=COLUMN_NAMES,
                 nrows=NROWS)

df_with_apostrophe.head(5)

Unnamed: 0,sentiment,tweet
0,0,aww that 's bummer shoulda got david carr third day
1,0,upset ca n't update facebook texting might cry result school today also blah
2,0,dived many time ball managed save 50 rest go bound
3,0,whole body feel itchy like fire
4,0,behaving i 'm mad ca n't see


In [49]:
# create smaller dataset
length_of_df_w = df_with_apostrophe.shape[0]
df_with_apostrophe_10 = df_with_apostrophe.iloc[int((length_of_df_w/2) - 80000):int((length_of_df_w/2) + 80000)]
df_with_apostrophe_5 = df_with_apostrophe.iloc[int((length_of_df_w/2) - 40000):int((length_of_df_w/2) + 40000)]

print('10% df', df_with_apostrophe_10.sentiment.value_counts())
print('5% df', df_with_apostrophe_5.sentiment.value_counts())

10% df sentiment
0    80067
1    79933
Name: count, dtype: int64
5% df sentiment
0    40067
1    39933
Name: count, dtype: int64


In [50]:
USED_DF = df_with_apostrophe_10
MAX_FEATURES = 18000
NGRAM_RANGE = (1,2)
TEST_SIZE = 0.2
N_NEIGHBORS = 5
METRIC = 'minkowski'
P_VALUE = 2
WEIGHTS = 'distance'

In [51]:
X_train, X_test, y_train, y_test = train_test_split(
                                    USED_DF.tweet,
                                    USED_DF.sentiment,
                                    test_size=TEST_SIZE,
                                    random_state=42
                                    )

In [52]:
# TFIDF model
tfidf_vectorizer = TfidfVectorizer(max_features=MAX_FEATURES, ngram_range=NGRAM_RANGE)
tfivectorizer = tfidf_vectorizer.fit(USED_DF.tweet)

X_train_tfidf = tfivectorizer.transform(X_train)
X_test_tfidf = tfivectorizer.transform(X_test)

print("X_train_shape", X_train_tfidf.toarray().shape)
print("X_test_shape", X_test_tfidf.toarray().shape)
print("y_train_shape", y_train.shape)
print("y_test_shape", y_test.shape)

X_train_shape (128000, 18000)
X_test_shape (32000, 18000)
y_train_shape (128000,)
y_test_shape (32000,)


In [53]:
clf_knn = KNeighborsClassifier(n_neighbors=N_NEIGHBORS, metric=METRIC, p=P_VALUE, weights=WEIGHTS).fit(X_train_tfidf, y_train)

pred_X_train = clf_knn.predict(X_train_tfidf)
pred_X_test = clf_knn.predict(X_test_tfidf)

print('classification report test\n', metrics.classification_report(y_test, pred_X_test))
print('test accuracy', metrics.accuracy_score(y_test, pred_X_test))
print('train accuracy', metrics.accuracy_score(y_train, pred_X_train))

classification report test
               precision    recall  f1-score   support

           0       0.65      0.51      0.58     15891
           1       0.60      0.73      0.66     16109

    accuracy                           0.62     32000
   macro avg       0.63      0.62      0.62     32000
weighted avg       0.63      0.62      0.62     32000

test accuracy 0.62409375
train accuracy 0.9905


#### GridSearch for best KNN parameters

In [45]:
PARAM_GRID = {'n_neighbors': [1, 5, 9, 13],
              'p': [1, 2],
              'weights': ('uniform', 'distance'),
              'metric': ['minkowski']
              }


grid = GridSearchCV(estimator=KNeighborsClassifier(),
                    param_grid=PARAM_GRID,
                    scoring='accuracy',
                    n_jobs=-1,
                    refit=True,
                    cv=5,
                    verbose=4
                    )

grid.fit(X_train_tfidf, y_train)

# Results
print(grid.best_params_)
print(grid.best_score_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
{'metric': 'minkowski', 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}
0.6134062499999999


In [46]:
# predict X_test for the best parameters
pred_grid = grid.predict(X_test_tfidf)

print('classification report grid test\n', metrics.classification_report(y_test, pred_grid))

classification report grid test
               precision    recall  f1-score   support

           0       0.64      0.54      0.58      7991
           1       0.60      0.70      0.65      8009

    accuracy                           0.62     16000
   macro avg       0.62      0.62      0.62     16000
weighted avg       0.62      0.62      0.62     16000



## 2. Dataset without apostrophe in regex and lemmatization

In [13]:
DATA_NAME = 'sentiment140/sentiment140_lem_all_no_apostrophe.csv'
ENCODING = 'latin-1'
COLUMN_NAMES = ['sentiment', 'tweet']
NROWS = 1600000

df_no_apostrophe = pd.read_csv(DATA_NAME,
                                encoding=ENCODING,
                                header=None,
                                names=COLUMN_NAMES,
                                nrows=NROWS
                                )

df_no_apostrophe.head(5)

Unnamed: 0,sentiment,tweet
0,0,"['aww', 'bummer', 'shoulda', 'got', 'david', 'carr', 'third', 'day']"
1,0,"['upset', 'update', 'facebook', 'texting', 'might', 'cry', 'result', 'school', 'today', 'also', 'blah']"
2,0,"['dived', 'many', 'time', 'ball', 'managed', 'save', '50', 'rest', 'go', 'bound']"
3,0,"['whole', 'body', 'feel', 'itchy', 'like', 'fire']"
4,0,"['behaving', 'mad', 'see']"


In [14]:
# create smaller dataset
length_of_df_w = df_no_apostrophe.shape[0]
df_no_apostrophe_10 = df_no_apostrophe.iloc[int((length_of_df_w/2) - 80000):int((length_of_df_w/2) + 80000)]
df_no_apostrophe_5 = df_no_apostrophe.iloc[int((length_of_df_w/2) - 40000):int((length_of_df_w/2) + 40000)]

print('10% df', df_no_apostrophe_10.sentiment.value_counts())
print('5% df', df_no_apostrophe_5.sentiment.value_counts())

10% df sentiment
0    80081
1    79919
Name: count, dtype: int64
5% df sentiment
0    40081
1    39919
Name: count, dtype: int64


In [15]:
USED_DF = df_no_apostrophe_5
MAX_FEATURES = 5000
NGRAM_RANGE = (1,1)
TEST_SIZE = 0.2
N_NEIGHBORS = 1

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
                                    USED_DF.tweet,
                                    USED_DF.sentiment,
                                    test_size=TEST_SIZE,
                                    random_state=42
                                    )

In [17]:
# TFIDF model
tfidf_vectorizer = TfidfVectorizer(max_features=MAX_FEATURES, ngram_range=NGRAM_RANGE)
tfivectorizer = tfidf_vectorizer.fit(USED_DF.tweet)

X_train_tfidf = tfivectorizer.transform(X_train)
X_test_tfidf = tfivectorizer.transform(X_test)

print("X_train_shape", X_train_tfidf.toarray().shape)
print("X_test_shape", X_test_tfidf.toarray().shape)
print("y_train_shape", y_train.shape)
print("y_test_shape", y_test.shape)

X_train_shape (64000, 5000)
X_test_shape (16000, 5000)
y_train_shape (64000,)
y_test_shape (16000,)


In [18]:
clf_knn = KNeighborsClassifier(n_neighbors=N_NEIGHBORS).fit(X_train_tfidf, y_train)

pred_X_train = clf_knn.predict(X_train_tfidf)
pred_X_test = clf_knn.predict(X_test_tfidf)

print('classification report test\n', metrics.classification_report(y_test, pred_X_test))
print('test accuracy', metrics.accuracy_score(y_test, pred_X_test))
print('train accuracy', metrics.accuracy_score(y_train, pred_X_train))

classification report test
               precision    recall  f1-score   support

           0       0.60      0.79      0.68      7992
           1       0.69      0.46      0.56      8008

    accuracy                           0.63     16000
   macro avg       0.65      0.63      0.62     16000
weighted avg       0.65      0.63      0.62     16000

test accuracy 0.6291875
train accuracy 0.985890625


#### GridSearch for best KNN parameters

In [19]:
PARAM_GRID = {'n_neighbors': [1, 5, 9, 13],
              'p': [1, 2],
              'weights': ('uniform', 'distance'),
              'metric': ['minkowski']
              }


grid = GridSearchCV(estimator=KNeighborsClassifier(),
                    param_grid=PARAM_GRID,
                    scoring='accuracy',
                    n_jobs=-1,
                    refit=True,
                    cv=5,
                    verbose=4
                    )

grid.fit(X_train_tfidf, y_train)

# results
print(grid.best_params_)
print(grid.best_score_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
{'metric': 'minkowski', 'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}
0.6222656249999999


In [20]:
# predict X_test for the best parameters
pred_grid = grid.predict(X_test_tfidf)

print('classification report grid test\n', metrics.classification_report(y_test, pred_grid))

classification report grid test
               precision    recall  f1-score   support

           0       0.60      0.79      0.68      7992
           1       0.69      0.46      0.56      8008

    accuracy                           0.63     16000
   macro avg       0.65      0.63      0.62     16000
weighted avg       0.65      0.63      0.62     16000



## 3. Dataset with apostrophe in regex and stemming

In [21]:
DATA_NAME = 'sentiment140/sentiment140_stem_all_with_apostrophe.csv'
ENCODING = 'latin-1'
COLUMN_NAMES = ['sentiment', 'tweet']
NROWS = 1600000

df_with_apostrophe_stem = pd.read_csv(DATA_NAME,
                 encoding=ENCODING,
                 header=None,
                 names=COLUMN_NAMES,
                 nrows=NROWS)

df_with_apostrophe_stem.head(5)

Unnamed: 0,sentiment,tweet
0,0,"['aww', 'that', ""'s"", 'bummer', 'shoulda', 'got', 'david', 'carr', 'third', 'day']"
1,0,"['upset', 'ca', ""n't"", 'updat', 'facebook', 'text', 'might', 'cri', 'result', 'school', 'today', 'also', 'blah']"
2,0,"['dive', 'mani', 'time', 'ball', 'manag', 'save', '50', 'rest', 'go', 'bound']"
3,0,"['whole', 'bodi', 'feel', 'itchi', 'like', 'fire']"
4,0,"['behav', 'i', ""'m"", 'mad', 'ca', ""n't"", 'see']"


In [22]:
# create smaller dataset
length_of_df_w = df_with_apostrophe_stem.shape[0]
df_with_apostrophe_stem_10 = df_with_apostrophe_stem.iloc[int((length_of_df_w/2) - 80000):int((length_of_df_w/2) + 80000)]
df_with_apostrophe_stem_5 = df_with_apostrophe_stem.iloc[int((length_of_df_w/2) - 40000):int((length_of_df_w/2) + 40000)]

print('10% df', df_with_apostrophe_stem_10.sentiment.value_counts())
print('5% df', df_with_apostrophe_stem_5.sentiment.value_counts())

10% df sentiment
0    80067
1    79933
Name: count, dtype: int64
5% df sentiment
0    40067
1    39933
Name: count, dtype: int64


In [23]:
USED_DF = df_with_apostrophe_stem_5
MAX_FEATURES = 5000
NGRAM_RANGE = (1,1)
TEST_SIZE = 0.2
N_NEIGHBORS = 1

In [24]:
X_train, X_test, y_train, y_test = train_test_split(
                                    USED_DF.tweet,
                                    USED_DF.sentiment,
                                    test_size=TEST_SIZE,
                                    random_state=42
                                    )

In [25]:
# TFIDF model
tfidf_vectorizer = TfidfVectorizer(max_features=MAX_FEATURES, ngram_range=NGRAM_RANGE)
tfivectorizer = tfidf_vectorizer.fit(USED_DF.tweet)

X_train_tfidf = tfivectorizer.transform(X_train)
X_test_tfidf = tfivectorizer.transform(X_test)

print("X_train_shape", X_train_tfidf.toarray().shape)
print("X_test_shape", X_test_tfidf.toarray().shape)
print("y_train_shape", y_train.shape)
print("y_test_shape", y_test.shape)

X_train_shape (64000, 5000)
X_test_shape (16000, 5000)
y_train_shape (64000,)
y_test_shape (16000,)


In [26]:
clf_knn = KNeighborsClassifier(n_neighbors=N_NEIGHBORS).fit(X_train_tfidf, y_train)

pred_X_train = clf_knn.predict(X_train_tfidf)
pred_X_test = clf_knn.predict(X_test_tfidf)

print('classification report test\n', metrics.classification_report(y_test, pred_X_test))
print('test accuracy', metrics.accuracy_score(y_test, pred_X_test))
print('train accuracy', metrics.accuracy_score(y_train, pred_X_train))

classification report test
               precision    recall  f1-score   support

           0       0.59      0.80      0.68      7991
           1       0.69      0.44      0.54      8009

    accuracy                           0.62     16000
   macro avg       0.64      0.62      0.61     16000
weighted avg       0.64      0.62      0.61     16000

test accuracy 0.6230625
train accuracy 0.988078125


#### GridSearch for best KNN parameters

In [27]:
PARAM_GRID = {'n_neighbors': [1, 5, 9, 13],
              'p': [1, 2],
              'weights': ('uniform', 'distance'),
              'metric': ['minkowski']
              }


grid = GridSearchCV(estimator=KNeighborsClassifier(),
                    param_grid=PARAM_GRID,
                    scoring='accuracy',
                    n_jobs=-1,
                    refit=True,
                    cv=5,
                    verbose=4
                    )

grid.fit(X_train_tfidf, y_train)

# Results
print(grid.best_params_)
print(grid.best_score_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
{'metric': 'minkowski', 'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}
0.615859375


In [28]:
# predict X_test for the best parameters
pred_grid = grid.predict(X_test_tfidf)

print('classification report grid test\n', metrics.classification_report(y_test, pred_grid))

classification report grid test
               precision    recall  f1-score   support

           0       0.59      0.80      0.68      7991
           1       0.69      0.44      0.54      8009

    accuracy                           0.62     16000
   macro avg       0.64      0.62      0.61     16000
weighted avg       0.64      0.62      0.61     16000



## 4. Dataset without apostrophe in regex and stemming

In [29]:
DATA_NAME = 'sentiment140/sentiment140_stem_all_no_apostrophe.csv'
ENCODING = 'latin-1'
COLUMN_NAMES = ['sentiment', 'tweet']
NROWS = 1600000

df_no_apostrophe_stem = pd.read_csv(DATA_NAME,
                 encoding=ENCODING,
                 header=None,
                 names=COLUMN_NAMES,
                 nrows=NROWS)

df_no_apostrophe_stem.head(5)

Unnamed: 0,sentiment,tweet
0,0,"['aww', 'bummer', 'shoulda', 'got', 'david', 'carr', 'third', 'day']"
1,0,"['upset', 'updat', 'facebook', 'text', 'might', 'cri', 'result', 'school', 'today', 'also', 'blah']"
2,0,"['dive', 'mani', 'time', 'ball', 'manag', 'save', '50', 'rest', 'go', 'bound']"
3,0,"['whole', 'bodi', 'feel', 'itchi', 'like', 'fire']"
4,0,"['behav', 'mad', 'see']"


In [30]:
# create smaller dataset
length_of_df_w = df_no_apostrophe_stem.shape[0]
df_no_apostrophe_stem_10 = df_no_apostrophe_stem.iloc[int((length_of_df_w/2) - 80000):int((length_of_df_w/2) + 80000)]
df_no_apostrophe_stem_5 = df_no_apostrophe_stem.iloc[int((length_of_df_w/2) - 40000):int((length_of_df_w/2) + 40000)]

print('10% df', df_no_apostrophe_stem_10.sentiment.value_counts())
print('5% df', df_no_apostrophe_stem_5.sentiment.value_counts())

10% df sentiment
0    80081
1    79919
Name: count, dtype: int64
5% df sentiment
0    40081
1    39919
Name: count, dtype: int64


In [31]:
USED_DF = df_no_apostrophe_stem_5
MAX_FEATURES = 5000
NGRAM_RANGE = (1,1)
TEST_SIZE = 0.2
N_NEIGHBORS = 1

In [32]:
X_train, X_test, y_train, y_test = train_test_split(
                                    USED_DF.tweet,
                                    USED_DF.sentiment,
                                    test_size=TEST_SIZE,
                                    random_state=42
                                    )

In [33]:
# TFIDF model
tfidf_vectorizer = TfidfVectorizer(max_features=MAX_FEATURES, ngram_range=NGRAM_RANGE)
tfivectorizer = tfidf_vectorizer.fit(USED_DF.tweet)

X_train_tfidf = tfivectorizer.transform(X_train)
X_test_tfidf = tfivectorizer.transform(X_test)

print("X_train_shape", X_train_tfidf.toarray().shape)
print("X_test_shape", X_test_tfidf.toarray().shape)
print("y_train_shape", y_train.shape)
print("y_test_shape", y_test.shape)

X_train_shape (64000, 5000)
X_test_shape (16000, 5000)
y_train_shape (64000,)
y_test_shape (16000,)


In [34]:
clf_knn = KNeighborsClassifier(n_neighbors=N_NEIGHBORS).fit(X_train_tfidf, y_train)

pred_X_train = clf_knn.predict(X_train_tfidf)
pred_X_test = clf_knn.predict(X_test_tfidf)

print('classification report test\n', metrics.classification_report(y_test, pred_X_test))
print('test accuracy', metrics.accuracy_score(y_test, pred_X_test))
print('train accuracy', metrics.accuracy_score(y_train, pred_X_train))

classification report test
               precision    recall  f1-score   support

           0       0.59      0.79      0.68      7992
           1       0.68      0.46      0.55      8008

    accuracy                           0.62     16000
   macro avg       0.64      0.62      0.61     16000
weighted avg       0.64      0.62      0.61     16000

test accuracy 0.622375
train accuracy 0.98734375


#### GridSearch for best KNN parameters

In [35]:
PARAM_GRID = {'n_neighbors': [1, 5, 9, 13],
              'p': [1, 2],
              'weights': ('uniform', 'distance'),
              'metric': ['minkowski']
              }


grid = GridSearchCV(estimator=KNeighborsClassifier(),
                    param_grid=PARAM_GRID,
                    scoring='accuracy',
                    n_jobs=-1,
                    refit=True,
                    cv=5,
                    verbose=4
                    )

grid.fit(X_train_tfidf, y_train)

# Results
print(grid.best_params_)
print(grid.best_score_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
{'metric': 'minkowski', 'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}
0.61959375


In [36]:
# predict X_test for the best parameters
pred_grid = grid.predict(X_test_tfidf)

print('classification report grid test\n', metrics.classification_report(y_test, pred_grid))

classification report grid test
               precision    recall  f1-score   support

           0       0.59      0.79      0.68      7992
           1       0.68      0.46      0.55      8008

    accuracy                           0.62     16000
   macro avg       0.64      0.62      0.61     16000
weighted avg       0.64      0.62      0.61     16000

