# Assignment 1
## Question `1` (K-Nearest Neighbour)

| | |
|-|-|
| Course | Statistical Methods in AI |
| Release Date | `19.01.2023` |
| Due Date | `29.01.2023` |

### Instructions:
1.   Assignment must be implemented using python notebook only (Colab , VsCode , Jupyter etc.)
2.   You are allowed to use libraries for data preprocessing (numpy, pandas, nltk etc) and for algorithms as well (sklearn etc). You are not however allowed to directly use classifier models.
3.   The performance of the model will hold weightage but you will also be graded largely for data preprocessing steps , explanations , feature selection for vectors etc.
4.   Strict plagiarism checking will be done. An F will be awarded for plagiarism.

### The Dataset
The dataset is to be downloaded from the following drive link ([Link](https://drive.google.com/file/d/1u55iIrTrn41n2lv8HBjtdKLhDcy_6s7O/view?usp=sharing)).
The dataset is a collection of *11099 tweets and has 31 columns*. The data will be in the form of a csv file which you can load in any format. The ground truth is available in the following drive link ([Link](https://drive.google.com/file/d/1--cozM5hXDwdbbDaWlB-8NqwSj0nh1Kg/view?usp=sharing)) which corresponds to whether a tweet was popular or not. Since the task involves selecting features yourself to vectorize a tweet , we suggest some data analysis of the columns you consider important.
<br><br>

### The Task
You have to build a classifier which can predict the popularity of the tweet, i.e , if the tweet was popular or not. You are required to use **KNN** algorithm to build the classifier and cannot use any inbuilt classifier. All columns are supposed to be analyzed , filtered and preprocessed to determine its importance as a feature in the vector for every tweet (Not every column will be useful).<br>
The Data contains the **raw text of the tweet**(in the text column) as well as other **meta data** like likes count , user followers count. Note that it might be useful to **create new columns** with useful information. For example, *number of hashtags* might be useful but is not directly present as a column.<br>
There are 3 main sub parts:
1. *Vectorize tweets using only meta data* - likes , user followers count , and other created data
2. *Vectorize tweets using only it's text*. This segment will require NLP techniques to clean the text and extract a vector using a BoW model. Here is a useful link for the same - [Tf-Idf](https://towardsdatascience.com/text-vectorization-term-frequency-inverse-document-frequency-tfidf-5a3f9604da6d). Since these vectors will be very large , we recommend reducing their dimensionality (~10 - 25). Hint: [Dimentionality Reduction](https://jonathan-hui.medium.com/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491). Please note that for this also you are allowed to use libraries.

3. *Combining the vectors from above two techinques to create one bigger vector*
<br>


Using KNN on these vectors build a classifier to predict the popularity of the tweet and report accuracies on each of the three methods as well as analysis. You can use sklearn's Nearest Neighbors and need not write KNN from scratch. (However you cannot use the classifier directly). You are expected to try the classifier for different number of neighbors and identify the optimal K value.

## Import necessary libraries

In [43]:
import pandas as pd
import numpy as np
import json
import matplotlib
import math
from tqdm import tqdm

## Load and display the data

In [2]:
data = pd.read_json("tweets.json", lines=True)
gt = np.loadtxt("ground_truth.csv")
print(gt)

print("The columns are as follows:")
for i in data.columns:
    print(i, end=", ")



pd.options.display.max_seq_items = 4000

print("\n\nSample data:")

a = 0
for i,j in zip(data.keys(), data.values[1]):
    print(i, " : ", j)

print("\n\n")
for x in range(len(data.values)):
    if gt[x] == 1:
        for i,j in zip(data.keys(), data.values[x]):
            print(i, " : ", j)
        break


[0. 0. 1. ... 0. 0. 0.]
The columns are as follows:
created_at, id, id_str, text, truncated, entities, metadata, source, in_reply_to_status_id, in_reply_to_status_id_str, in_reply_to_user_id, in_reply_to_user_id_str, in_reply_to_screen_name, user, geo, coordinates, place, contributors, retweeted_status, is_quote_status, retweet_count, favorite_count, favorited, retweeted, lang, possibly_sensitive, quoted_status_id, quoted_status_id_str, extended_entities, quoted_status, withheld_in_countries, 

Sample data:
created_at  :  2018-07-31 13:34:40+00:00
id  :  1024287229512953856
id_str  :  1024287229512953856
text  :  @hail_ee23 Thanks love its just the feeling of eyes that get me so nervous ❤️
truncated  :  False
entities  :  {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'hail_ee23', 'name': 'Jordan Vaughn', 'id': 927185727053553665, 'id_str': '927185727053553665', 'indices': [0, 10]}], 'urls': []}
metadata  :  {'iso_language_code': 'en', 'result_type': 'recent'}
source

## Exploratory Data Analysis
*This is an ungraded section but is recommended to get a good grasp on the dataset*

In [None]:
# your code here

## Part-1
*Vectorize tweets using only meta data*

In [118]:
# relevant metadata columns
# relCols = ["id", "fcount", "frndcount", "lcount", "user favcount", "verified", "scount", "rt count", "fav count", "sens"]
relCols = ["id", "fcount",  "frncount", "rt count", "fav count"]

# return data and test splits
def getFeatures(data, percent=0.9):
  """
  Funtion to return a matrix of dimensions (number of tweets, number of chosen features)
  Input parameters to this funcion are to be chosen as per requirement (Example: Loaded dataframe of the dataset) 
  """

  """
  relevant columns:
  follower count
  friend count
  listed_count
  favourite count
  verified
  statuses_count
  rt count
  fac count
  sensitive
  """

  vecWidth = len(relCols)
  vecHeight = int(len(data.values) * percent)   # for data

  dataFeatureVector = np.zeros((vecHeight, vecWidth), dtype=np.int32)
  testFeatureVector = np.zeros((len(data.values) - vecHeight, vecWidth), dtype=np.int32)

  columns = list(data.columns)


  testCount = 0
  for i, item in enumerate(data.values):    
    feature = np.zeros(vecWidth)

    feature[0] = i                                                    # ID
    feature[1] = item[columns.index("user")]["followers_count"]
    feature[2] = item[columns.index("user")]["friends_count"]
    feature[3] = item[columns.index("retweet_count")]
    feature[4] = item[columns.index("favorite_count")]
    # feature[5] = item[columns.index("user")]["friends_count"]
    # feature[6] = item[columns.index("user")]["verified"]
    # feature[7] = item[columns.index("user")]["statuses_count"]
    # feature[8] = item[columns.index("retweet_count")]
    # feature[9] = item[columns.index("favorite_count")]
    
    if i < vecHeight:
      dataFeatureVector[i] = feature
    else:
      testFeatureVector[testCount] = feature
      testCount += 1

  return dataFeatureVector, testFeatureVector

featureVector, testFeatureVector = getFeatures(data)

print("Metadata only feature vector for the data split: \n", featureVector, featureVector.shape)
print("Metadata only feature vector for the test split: \n", testFeatureVector, testFeatureVector.shape)


Metadata only feature vector for the data split: 
 [[    0   215   335     3     0]
 [    1   199   203     0     0]
 [    2   196   558     5     0]
 ...
 [ 9986   905   882     0     0]
 [ 9987    42    68     0     0]
 [ 9988   275   212 49818     0]] (9989, 5)
Metadata only feature vector for the test split: 
 [[ 9989   842   847  2481     0]
 [ 9990   100   786   131     0]
 [ 9991  8977  7696 15142     0]
 ...
 [11096   135    90     0     0]
 [11097    59   320     3     0]
 [11098  1563  1697     0     0]] (1110, 5)


Perform KNN using the vector obtained from get_features() function. Following are the steps to be followed:
1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values. 

In [119]:
# Part 1 (normalising)

def normaliseFeatureVector(featureVector):
    normFeatureVector = np.zeros_like(featureVector, dtype=np.float32)

    _, nCols = featureVector.shape

    normFeatureVector[:, 0] = featureVector[:, 0] # ID
    for i in range(1, nCols):
        sum = np.sum(featureVector[:, i])

        if sum != 0:
            normFeatureVector[:, i] = featureVector[:, i] / sum

        else:
            print("Zero sum for column %i" % i)
            normFeatureVector[:, i] = featureVector[:, i]

    return normFeatureVector

# Part 2 (split)
# using a 90/10 training(data)/test(eval) split

dataFeatureVector, testFeatureVector = getFeatures(data, percent=0.9)
nDataFeatureVector = normaliseFeatureVector(featureVector)
nTestFeatureVector = normaliseFeatureVector(testFeatureVector)

print("Normalised data feature vector:\n", nDataFeatureVector, "\n")
print("Normalised test feature vector:\n", nTestFeatureVector, "\n")


Normalised data feature vector:
 [[0.00000000e+00 3.39048347e-06 2.28933823e-05 1.08625315e-07
  0.00000000e+00]
 [1.00000000e+00 3.13816849e-06 1.38727064e-05 0.00000000e+00
  0.00000000e+00]
 [2.00000000e+00 3.09085931e-06 3.81328573e-05 1.81042196e-07
  0.00000000e+00]
 ...
 [9.98600000e+03 1.42715699e-05 6.02745167e-05 0.00000000e+00
  0.00000000e+00]
 [9.98700000e+03 6.62327011e-07 4.64701498e-06 0.00000000e+00
  0.00000000e+00]
 [9.98800000e+03 4.33666492e-06 1.44877522e-05 1.80383201e-03
  0.00000000e+00]] 

Normalised test feature vector:
 [[9.9890000e+03 2.5647532e-04 6.1580248e-04 7.7177514e-04 0.0000000e+00]
 [9.9900000e+03 3.0460251e-05 5.7145307e-04 4.0750721e-05 0.0000000e+00]
 [9.9910000e+03 2.7344169e-03 5.5952962e-03 4.7102859e-03 0.0000000e+00]
 ...
 [1.1096000e+04 4.1121340e-05 6.5433560e-05 0.0000000e+00 0.0000000e+00]
 [1.1097000e+04 1.7971548e-05 2.3265266e-04 9.3322262e-07 0.0000000e+00]
 [1.1098000e+04 4.7609373e-04 1.2337861e-03 0.0000000e+00 0.0000000e+00]] 



In [121]:
# Part 3 (fitting and determining k)

def calcL2Norm(a, b):
    if len(a) != len(b):
        print("Incorrect dim")
        exit(1)
    
    sum = 0
    for x in range(1,len(a)):
        sum += (a[x]-b[x]) * (a[x]-b[x])
    
    return math.sqrt(sum)


def kNN(k, nDataFeatureVector, nTestFeatureVector):
    correctPredictions = 0
    correctPositive = 0
    totalPredictions = len(nTestFeatureVector)
    totalPositives = 0

    # get the features with the k smallest costs
    for testEntry in tqdm(nTestFeatureVector):
        allCosts = nDataFeatureVector - testEntry
        allCosts = np.square(allCosts)
        allCosts = np.sum(allCosts[:, 1:], axis=1)        

        sortedCosts = np.zeros((allCosts.shape[0], 2))
        sortedCosts[:,0] = np.sqrt(allCosts)
        sortedCosts[:,1] = np.arange(len(nDataFeatureVector))                

        ind = np.argsort(sortedCosts[:,0])
        sortedCosts = sortedCosts[ind]

        closest = sortedCosts[:k]
        # check nbd classes
        tot = 0.0
        predictedClass = 0

        bleep = 0
        for i in closest:
            # print(i.shape)
            # print(i)
            # print(gt[int(i[1])])
            if gt[int(i[1])] == 1.0:
                bleep += 1
            tot += gt[int(i[1])]
        
        # if bleep > 0:
            # print(bleep, testEntry[0])

        if float(tot)/k < 0.5:
            predictedClass = 0
        else:
            predictedClass = 1

        if gt[int(testEntry[0])] == predictedClass:
            correctPredictions += 1
            if predictedClass == 1:
                correctPositive += 1

        if gt[int(testEntry[0])] == 1:
            totalPositives += 1
    
    accuracy = float(correctPredictions)/totalPredictions
    positiveAccuracy = float(correctPositive)/totalPositives
    return accuracy, positiveAccuracy, correctPositive, totalPositives

# loop over all ks, calculate accuracy and percent of correctly predicted positives
maxK = 20
for k in range(1, maxK):
    accuracy, positiveAccuracy, correctPositive, tp = kNN(k, nDataFeatureVector, nTestFeatureVector)
    print("K = ", k, " | Accuracy = ", accuracy, " | Correctly predicted positives = ", positiveAccuracy, " | kp = ", correctPositive, " out of ", tp)
        


SyntaxError: invalid syntax (2045060831.py, line 72)

In [48]:
b = np.sum(gt)
b
print(b/len(gt))

0.07748445805928462


## Part-2
Vectorize tweets based on the text. More details and reference links can be checked on the Tasks list in the start of the notebook

In [None]:
def tweet_vectoriser():
  """
  Funtion to return a matrix of dimensions (number of tweets, number of features extracted per tweet)
  Following are the steps for be followed:
    1. Remove links, tags and hashtags from each tweet.
    2. Apply TF-IDF on the tweets to extract a vector. 
    3. Perform dimensionality reduction on the obtained vector. 
  Input parameters to this funcion are to be chosen as per requirement (Example: Array of tweets) 
  """
  # your code here

Perform KNN using the vector obtained from tweet_vectoriser() function. Following are the steps to be followed:

1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values.

In [None]:
# your code here

## Part-3
### Subpart-1

Combine both the vectors obtained from the tweet_vectoriser() and get_features()

In [None]:
# your code here

Perform KNN using the vector obtained in the previous step. Following are the steps to be followed:

1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values.

In [None]:
# your code here

### Subpart-2

Explain the differences between the accuracies obtained in each part above based on the features used.