## K-Nearest Neighbour


### The Dataset
The dataset is avaible in the zip file which is a collection of *11099 tweets*. The data will be in the form of a csv file. The ground truth is also given in the zip file which corresponds to whether a tweet was popular or not. Since the task involves selecting features yourself to vectorize a tweet , we suggest some data analysis of the columns you consider important.
<br><br>

### The Task
You have to build a classifier which can predict the popularity of the tweet, i.e , if the tweet was popular or not. You are required to use **KNN** algorithm to build the classifier and cannot use any inbuilt classifier. All columns are supposed to be analyzed , filtered and preprocessed to determine its importance as a feature in the vector for every tweet (Not every column will be useful).<br>
The Data contains the **raw text of the tweet**(in the text column) as well as other **meta data** like likes count , user followers count. Note that it might be useful to **create new columns** with useful information. For example, *number of hashtags* might be useful but is not directly present as a column.<br>
There are 3 main sub parts:
1. *Vectorize tweets using only meta data* - likes , user followers count , and other created data
2. *Vectorize tweets using only it's text*. This segment will require NLP techniques to clean the text and extract a vector using a BoW model. Here is a useful link for the same - [Tf-Idf](https://towardsdatascience.com/text-vectorization-term-frequency-inverse-document-frequency-tfidf-5a3f9604da6d). Since these vectors will be very large , we recommend reducing their dimensinality (~10 - 25). Hint: [Dimentionality Reduction](https://jonathan-hui.medium.com/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491). Please note that for this also you are allowed to use libraries.

3. *Combining the vectors from above two techinques to create one bigger vector*
<br>


Using KNN on these vectors build a classifier to predict the popularity of the tweet and report accuracies on each of the three methods as well as analysis. You can use sklearn's Nearest Neighbors and need not write KNN from scratch. (However you cannot use the classifier directly). You are expected to try the classifier for different number of neighbors and identify the optimal K value.

## Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import json
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score


nltk.download("words")
nltk.download("stopwords")
nltk.download('punkt')

[nltk_data] Downloading package words to /home/aakash/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /home/aakash/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/aakash/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Load and display the data

In [3]:
f = open('ground_truth.csv')
label=[]
for i in f.readlines():
    label.append(int(i[0]))

df = pd.read_csv('Tweets.csv')
df['label']=label

In [4]:
df.head()

Unnamed: 0,created_at,id,id_str,text,truncated,entities,metadata,source,is_quote_status,retweet_count,...,user_name,user_screen_name,user_followers_count,user_friends_count,user_listed_count,user_created_at,user_favourites_count,user_verified,user_statuses_count,label
0,Tue Jul 31 13:34:34 +0000 2018,1.02429e+18,1.02429e+18,RT @EdwardTHardy: The 7th US Circuit Court of ...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",False,113,...,Sherry Wahl,queenfancygirl,153,264,7,Thu Mar 18 19:16:31 +0000 2010,32984,False,31308,0
1,Tue Jul 31 13:34:14 +0000 2018,1.02429e+18,1.02429e+18,RT @VenomMovie: The world has enough superhero...,False,"{'hashtags': [{'text': 'Venom', 'indices': [64...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",False,5902,...,Kay Khairil ðŸŒ,ikaykhairil,780,382,12,Wed Mar 17 03:27:51 +0000 2010,6648,False,87272,0
2,Tue Jul 31 13:34:40 +0000 2018,1.02429e+18,1.02429e+18,RT @FutbolBible: Teachers vs Students match &a...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",False,3745,...,Charlie Hamilton,ch100897,255,246,1,Sun Mar 03 09:23:03 +0000 2013,5426,False,1731,1
3,Tue Jul 31 13:34:27 +0000 2018,1.02429e+18,1.02429e+18,RT @mashable: Someone from 'The Office' actual...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,10,...,Mike Santos,mikesantosx71,2419,2428,4,Thu May 25 14:37:29 +0000 2017,5993,False,2153,0
4,Tue Jul 31 13:34:28 +0000 2018,1.02429e+18,1.02429e+18,RT @_missj0hnson: Iâ€™m at Starbucks asking fo...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",False,25306,...,Soots,DaAverageDingus,314,722,7,Tue Mar 15 01:14:02 +0000 2011,6285,False,33503,1


## Exploratory Data Analysis
*This is an ungraded section but is recommended to get a good grasp on the dataset*

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11099 entries, 0 to 11098
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   created_at             11099 non-null  object 
 1   id                     11099 non-null  float64
 2   id_str                 11099 non-null  float64
 3   text                   11099 non-null  object 
 4   truncated              11099 non-null  bool   
 5   entities               11099 non-null  object 
 6   metadata               11099 non-null  object 
 7   source                 11099 non-null  object 
 8   is_quote_status        11099 non-null  bool   
 9   retweet_count          11099 non-null  int64  
 10  favorite_count         11099 non-null  int64  
 11  lang                   11099 non-null  object 
 12  user_name              11099 non-null  object 
 13  user_screen_name       11099 non-null  object 
 14  user_followers_count   11099 non-null  int64  
 15  us

## Part-1
*Vectorize tweets using only meta data*

In [9]:
def get_features(df):
    
    train_df = df[['is_quote_status','retweet_count','favorite_count','user_followers_count','user_friends_count','user_listed_count','user_favourites_count','user_verified','user_statuses_count']]
    
    for i in train_df.index:
        if train_df['is_quote_status'][i]:
            train_df['is_quote_status'][i]=1
        else:
            train_df['is_quote_status'][i]=0
        if train_df['user_verified'][i]:
            train_df['user_verified'][i]=1
        else:
            train_df['user_verified'][i]=0
    X = train_df.iloc[:,:-1].values
    y = train_df.iloc[:,-1].values 
    return X
    

In [10]:
features = get_features(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['is_quote_status'][i]=0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['user_verified'][i]=0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['is_quote_status'][i]=1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['user_verified'][i]=1


Perform KNN using the vector obtained from get_features() function. Following are the steps to be followed:
1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values. 

In [11]:
scaler = MinMaxScaler()
scaled = scaler.fit_transform(features)

x = scaled
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

In [12]:
k=5
knn = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(X_train)
dist, indices = knn.kneighbors(X_test)
y_pred=[]
for i in range(len(X_test)):
    d={0:0,1:1}
    for j in range(k):
        d[y_train[indices[i][j]]]+=1
    Key_max = max(zip(d.values(), d.keys()))[1]  
    y_pred.append(Key_max)
y_pred= np.array(y_pred)

In [13]:
accuracy_score(y_test,y_pred)

0.940990990990991

## Part-2
Vectorize tweets based on the text. More details and reference links can be checked on the Tasks list in the start of the notebook

In [14]:
def tweet_vectoriser(df):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    words = set(nltk.corpus.words.words())
    for i in range(len(df)):
        df['text'][i]=re.sub(r'http\S+', '', df['text'][i])
        df['text'][i] = " ".join(w for w in nltk.wordpunct_tokenize(df['text'][i]) if w.lower() in words or not w.isalpha())
        word_tokens = nltk.tokenize.word_tokenize(df['text'][i])
        filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
        sentence=""
        for j in filtered_sentence:
            sentence += " "+ j
        df['text'][i]=sentence

    #TFIDF
    vectoriser = TfidfVectorizer()
    X = vectoriser.fit_transform(df['text'])
    tf = np.array(X.toarray())
    #PCA
    pca = PCA(n_components=11)
    tf_pca = pca.fit_transform(tf)
    tf_pca = np.array(tf_pca)
    
    return tf_pca

Perform KNN using the vector obtained from tweet_vectoriser() function. Following are the steps to be followed:

1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values.

In [15]:
text_features = tweet_vectoriser(df)

#Scaling
scaled = scaler.fit_transform(text_features)
x = scaled
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i]=re.sub(r'http\S+', '', df['text'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = " ".join(w for w in nltk.wordpunct_tokenize(df['text'][i]) if w.lower() in words or not w.isalpha())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i]=sentence


In [16]:
k=5
knn = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(X_train)
dist, indices = knn.kneighbors(X_test)
y_pred=[]
for i in range(len(X_test)):
    d={0:0,1:1}
    for j in range(k):
        d[y_train[indices[i][j]]]+=1
    Key_max = max(zip(d.values(), d.keys()))[1]  
    y_pred.append(Key_max)
y_pred= np.array(y_pred)

In [17]:
accuracy_score(y_test,y_pred)

0.8968468468468469

## Part-3
### Subpart-1

Combine both the vectors obtained from the tweet_vectoriser() and get_features()

In [18]:
features = get_features(df)
text_features = tweet_vectoriser(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['is_quote_status'][i]=0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['user_verified'][i]=0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['is_quote_status'][i]=1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['user_verified'][i]=1
A value is trying to be set on a copy of

In [19]:
x = np.column_stack([features,text_features])
y = df.iloc[:,-1].values

print(np.shape(x),np.shape(y))

(11099, 19) (11099,)


Perform KNN using the vector obtained in the previous step. Following are the steps to be followed:

1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values.

In [20]:
scaled = scaler.fit_transform(x)

x = scaled
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

In [21]:
k=5
knn = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(X_train)
dist, indices = knn.kneighbors(X_test)
y_pred=[]
for i in range(len(X_test)):
    d={0:0,1:1}
    for j in range(k):
        d[y_train[indices[i][j]]]+=1
    Key_max = max(zip(d.values(), d.keys()))[1]  
    y_pred.append(Key_max)
y_pred= np.array(y_pred)

In [22]:
accuracy_score(y_test,y_pred)

0.9333333333333333

### Subpart-2

Explain the differences between the accuracies obtained in each part above based on the features used.

Part1- accuracy is 94%, as the data is numeric and can be scaled the absolute values, its predicted better
Part2- Accuracy is 89%, as the data is text and PDA is applied, some info can be lost in reduction process, thus, a bit less accuracy
Part3 - 93% accuracy, as the data is numeric+text, its a good blend of features