# Assignment 1
## Question `1` (K-Nearest Neighbour)

| | |
|-|-|
| Course | Statistical Methods in AI |
| Release Date | `19.01.2023` |
| Due Date | `29.01.2023` |

### Instructions:
1.   Assignment must be implemented using python notebook only (Colab , VsCode , Jupyter etc.)
2.   You are allowed to use libraries for data preprocessing (numpy, pandas, nltk etc) and for algorithms as well (sklearn etc). You are not however allowed to directly use classifier models.
3.   The performance of the model will hold weightage but you will also be graded largely for data preprocessing steps , explanations , feature selection for vectors etc.
4.   Strict plagiarism checking will be done. An F will be awarded for plagiarism.

### The Dataset
The dataset is to be downloaded from the following drive link ([Link](https://drive.google.com/file/d/1u55iIrTrn41n2lv8HBjtdKLhDcy_6s7O/view?usp=sharing)).
The dataset is a collection of *11099 tweets and has 31 columns*. The data will be in the form of a csv file which you can load in any format. The ground truth is available in the following drive link ([Link](https://drive.google.com/file/d/1--cozM5hXDwdbbDaWlB-8NqwSj0nh1Kg/view?usp=sharing)) which corresponds to whether a tweet was popular or not. Since the task involves selecting features yourself to vectorize a tweet , we suggest some data analysis of the columns you consider important.
<br><br>

### The Task
You have to build a classifier which can predict the popularity of the tweet, i.e , if the tweet was popular or not. You are required to use **KNN** algorithm to build the classifier and cannot use any inbuilt classifier. All columns are supposed to be analyzed , filtered and preprocessed to determine its importance as a feature in the vector for every tweet (Not every column will be useful).<br>
The Data contains the **raw text of the tweet**(in the text column) as well as other **meta data** like likes count , user followers count. Note that it might be useful to **create new columns** with useful information. For example, *number of hashtags* might be useful but is not directly present as a column.<br>
There are 3 main sub parts:
1. *Vectorize tweets using only meta data* - likes , user followers count , and other created data
2. *Vectorize tweets using only it's text*. This segment will require NLP techniques to clean the text and extract a vector using a BoW model. Here is a useful link for the same - [Tf-Idf](https://towardsdatascience.com/text-vectorization-term-frequency-inverse-document-frequency-tfidf-5a3f9604da6d). Since these vectors will be very large , we recommend reducing their dimensionality (~10 - 25). Hint: [Dimentionality Reduction](https://jonathan-hui.medium.com/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491). Please note that for this also you are allowed to use libraries.

3. *Combining the vectors from above two techinques to create one bigger vector*
<br>


Using KNN on these vectors build a classifier to predict the popularity of the tweet and report accuracies on each of the three methods as well as analysis. You can use sklearn's Nearest Neighbors and need not write KNN from scratch. (However you cannot use the classifier directly). You are expected to try the classifier for different number of neighbors and identify the optimal K value.

## Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib
import math
from tqdm import tqdm
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## Load and display the data

In [2]:
data = pd.read_json("tweets.json", lines=True)
gt = np.loadtxt("ground_truth.csv")
print(gt)

print("The columns are as follows:")
for i in data.columns:
    print(i, end=", ")

data.insert(0, "itemID", range(0, len(data)))


pd.options.display.max_seq_items = 4000

print("\n\nSample data:")

a = 0
for i,j in zip(data.keys(), data.values[1]):
    print(i, " : ", j)

print("\n\n")
for x in range(len(data.values)):
    if gt[x] == 1:
        for i,j in zip(data.keys(), data.values[x]):
            print(i, " : ", j)
        break


[0. 0. 1. ... 0. 0. 0.]
The columns are as follows:
created_at, id, id_str, text, truncated, entities, metadata, source, in_reply_to_status_id, in_reply_to_status_id_str, in_reply_to_user_id, in_reply_to_user_id_str, in_reply_to_screen_name, user, geo, coordinates, place, contributors, retweeted_status, is_quote_status, retweet_count, favorite_count, favorited, retweeted, lang, possibly_sensitive, quoted_status_id, quoted_status_id_str, extended_entities, quoted_status, withheld_in_countries, 

Sample data:
itemID  :  1
created_at  :  2018-07-31 13:34:40+00:00
id  :  1024287229512953856
id_str  :  1024287229512953856
text  :  @hail_ee23 Thanks love its just the feeling of eyes that get me so nervous ❤️
truncated  :  False
entities  :  {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'hail_ee23', 'name': 'Jordan Vaughn', 'id': 927185727053553665, 'id_str': '927185727053553665', 'indices': [0, 10]}], 'urls': []}
metadata  :  {'iso_language_code': 'en', 'result_type': 're

## Exploratory Data Analysis
*This is an ungraded section but is recommended to get a good grasp on the dataset*

In [3]:
# your code here

## Part-1
*Vectorize tweets using only meta data*

In [32]:
# relevant metadata columns
# relCols = ["id", "fcount", "frndcount", "lcount", "user favcount", "verified", "scount", "rt count", "fav count", "sens"]
relCols = ["id", "fcount"]#,  "rt count"] #,  "frncount", "fav count", ]

# return data and test splits
def getFeatures(data, percent=0.9):
  """
  Funtion to return a matrix of dimensions (number of tweets, number of chosen features)
  Input parameters to this funcion are to be chosen as per requirement (Example: Loaded dataframe of the dataset) 
  """

  """
  relevant columns:
  follower count
  friend count
  listed_count
  favourite count
  verified
  statuses_count
  rt count
  fac count
  sensitive
  """

  vecWidth = len(relCols)
  vecHeight = int(len(data.values))   # for data

  dataFeatureVector = np.zeros((vecHeight, vecWidth), dtype=np.int32)

  columns = list(data.columns)


  testCount = 0
  for i, item in enumerate(data.values):    
    feature = np.zeros(vecWidth)

    feature[0] = item[columns.index("itemID")]                                                    # ID
    feature[1] = item[columns.index("retweet_count")]
    # feature[1] = item[columns.index("user")]["followers_count"]
    # feature[3] = item[columns.index("user")]["friends_count"]
    # feature[4] = item[columns.index("favorite_count")]
    # feature[5] = item[columns.index("user")]["friends_count"]
    # feature[6] = item[columns.index("user")]["verified"]
    # feature[7] = item[columns.index("user")]["statuses_count"]
    # feature[8] = item[columns.index("retweet_count")]
    # feature[9] = item[columns.index("favorite_count")]
    
    dataFeatureVector[i] = feature

  return dataFeatureVector

data = data.sample(frac=1)

fullVector = getFeatures(data)
featureVector = fullVector[:int(0.9*fullVector.shape[0])]
testFeatureVector = fullVector[int(0.9*fullVector.shape[0]):]

print("Metadata only feature vector for the data split: \n", featureVector, featureVector.shape)
print("Metadata only feature vector for the test split: \n", testFeatureVector, testFeatureVector.shape)


Metadata only feature vector for the data split: 
 [[ 9503  1606     0]
 [ 5278  1604     0]
 [10478    86     0]
 ...
 [ 5594   776 12564]
 [10115  2164     0]
 [ 6761   212     0]] (9989, 3)
Metadata only feature vector for the test split: 
 [[ 7232  1105  1025]
 [ 1209   226     0]
 [ 3464    48     0]
 ...
 [ 5796   103   346]
 [ 2072   222     0]
 [10349    14     0]] (1110, 3)


Perform KNN using the vector obtained from get_features() function. Following are the steps to be followed:
1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values. 

In [33]:
# Part 1 (normalising)

def normaliseFeatureVector(featureVector):
    normFeatureVector = np.zeros_like(featureVector, dtype=np.float32)

    _, nCols = featureVector.shape

    normFeatureVector[:, 0] = featureVector[:, 0] # ID
    for i in range(1, nCols):
        sum = np.sum(featureVector[:, i])

        if sum != 0:
            normFeatureVector[:, i] = featureVector[:, i] / sum

        else:
            print("Zero sum for column %i" % i)
            normFeatureVector[:, i] = featureVector[:, i]

    return normFeatureVector

# Part 2 (split)
# using a 90/10 training(data)/test(eval) split

fullVector = getFeatures(data, percent=0.7)
dataFeatureVector = fullVector[:int(0.9*fullVector.shape[0])]
testFeatureVector = fullVector[int(0.9*fullVector.shape[0]):]

nDataFeatureVector = normaliseFeatureVector(dataFeatureVector)
nTestFeatureVector = normaliseFeatureVector(testFeatureVector)

print("Normalised data feature vector:\n", nDataFeatureVector, "\n")
print("Normalised test feature vector:\n", nTestFeatureVector, "\n")


Normalised data feature vector:
 [[9.5030000e+03 3.6767946e-05 0.0000000e+00]
 [5.2780000e+03 3.6722158e-05 0.0000000e+00]
 [1.0478000e+04 1.9688937e-06 0.0000000e+00]
 ...
 [5.5940000e+03 1.7765833e-05 4.5195647e-04]
 [1.0115000e+04 4.9542861e-05 0.0000000e+00]
 [6.7610000e+03 4.8535521e-06 0.0000000e+00]] 

Normalised test feature vector:
 [[7.2320000e+03 4.8009239e-05 3.3790455e-04]
 [1.2090000e+03 9.8190840e-06 0.0000000e+00]
 [3.4640000e+03 2.0854693e-06 0.0000000e+00]
 ...
 [5.7960000e+03 4.4750695e-06 1.1406339e-04]
 [2.0720000e+03 9.6452950e-06 0.0000000e+00]
 [1.0349000e+04 6.0826187e-07 0.0000000e+00]] 



In [34]:
# Part 3 (fitting and determining k)

def calcL2Norm(a, b):
    if len(a) != len(b):
        print("Incorrect dim")
        exit(1)
    
    sum = 0
    for x in range(1,len(a)):
        sum += (a[x]-b[x]) * (a[x]-b[x])
    
    return math.sqrt(sum)


def kNN(k, nDataFeatureVector, nTestFeatureVector):
    correctPredictions = 0
    correctPositive = 0
    totalPredictions = len(nTestFeatureVector)
    totalPositives = 0

    # get the features with the k smallest costs
    for testEntry in tqdm(nTestFeatureVector):
        allCosts = nDataFeatureVector - testEntry
        allCosts = np.square(allCosts)
        allCosts = np.sum(allCosts[:, 1:], axis=1)        

        sortedCosts = np.zeros((allCosts.shape[0], 2))
        sortedCosts[:,0] = np.sqrt(allCosts)
        sortedCosts[:,1] = np.arange(len(nDataFeatureVector))                

        ind = np.argsort(sortedCosts[:,0])
        sortedCosts = sortedCosts[ind]

        closest = sortedCosts[:k]
        # check nbd classes
        tot = 0.0
        predictedClass = 0

        bleep = 0
        for i in closest:
            # print(i.shape)
            # print(i)
            # print(gt[int(i[1])])
            if gt[int(i[1])] == 1.0:
                bleep += 1
            tot += gt[int(i[1])]
        
        # if bleep > 0:
            # print(bleep, testEntry[0])

        if float(tot)/k < 0.5:
            predictedClass = 0
        else:
            predictedClass = 1

        if gt[int(testEntry[0])] == predictedClass:
            correctPredictions += 1
            if predictedClass == 1:
                correctPositive += 1

        if gt[int(testEntry[0])] == 1:
            totalPositives += 1
    
    accuracy = float(correctPredictions)/totalPredictions
    positiveAccuracy = float(correctPositive)/totalPositives
    return accuracy, positiveAccuracy, correctPositive, totalPositives

# loop over all ks, calculate accuracy and percent of correctly predicted positives
maxK = 10
for k in range(1, maxK):
    accuracy, positiveAccuracy, correctPositive, tp = kNN(k, nDataFeatureVector, nTestFeatureVector)
    print("K = ", k, " | Accuracy = ", accuracy, " | Correctly predicted positives = ", positiveAccuracy, " | kp = ", correctPositive, " out of ", tp)
        


100%|██████████| 1110/1110 [00:00<00:00, 1264.54it/s]


K =  1  | Accuracy =  0.8801801801801802  | Correctly predicted positives =  0.05128205128205128  | kp =  4  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1249.61it/s]


K =  2  | Accuracy =  0.8216216216216217  | Correctly predicted positives =  0.23076923076923078  | kp =  18  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1287.08it/s]


K =  3  | Accuracy =  0.9117117117117117  | Correctly predicted positives =  0.01282051282051282  | kp =  1  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1308.90it/s]


K =  4  | Accuracy =  0.8918918918918919  | Correctly predicted positives =  0.02564102564102564  | kp =  2  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1305.22it/s]


K =  5  | Accuracy =  0.9252252252252252  | Correctly predicted positives =  0.0  | kp =  0  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1309.85it/s]


K =  6  | Accuracy =  0.9171171171171171  | Correctly predicted positives =  0.0  | kp =  0  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1314.85it/s]


K =  7  | Accuracy =  0.9288288288288288  | Correctly predicted positives =  0.0  | kp =  0  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1318.48it/s]


K =  8  | Accuracy =  0.927027027027027  | Correctly predicted positives =  0.0  | kp =  0  out of  78


100%|██████████| 1110/1110 [00:00<00:00, 1272.22it/s]

K =  9  | Accuracy =  0.9297297297297298  | Correctly predicted positives =  0.0  | kp =  0  out of  78





For the chosen metadata fields, the optimal value of k seems to be 2.
The actual accuracy is not a good metric due to how heavily skewed the data is to not popular tweets. Instead, the number of correctly predicted positives is included as a metric.
Past a certain value of k, the number of non popular tweets in the nearest neighbours greatly outweighs the closeby positive tweets, thus no tweet is ever predicted to be popular.

## Part-2
Vectorize tweets based on the text. More details and reference links can be checked on the Tasks list in the start of the notebook

In [16]:

def tweet_vectoriser(data, percent=0.9):
  """
  Funtion to return a matrix of dimensions (number of tweets, number of features extracted per tweet)
  Following are the steps for be followed:
    1. Remove links, tags and hashtags from each tweet.
    2. Apply TF-IDF on the tweets to extract a vector. 
    3. Perform dimensionality reduction on the obtained vector. 
  Input parameters to this funcion are to be chosen as per requirement (Example: Array of tweets) 
  """
  
  # clean text
  cleanText = []
  columns = list(data.columns)

  for i in data.values:
    text = i[columns.index("text")]

    removalRegex = "@[a-z0-9]*|[?!\.\*]*| [a-z0-9]*…$|^rt|[ ]{2,}|https:\/\/.*|\&amp|\\n"
    leaveNothingBehind = "[^a-zA-Z\- #\n]"
    removeUncommonStops = " im | got | just | ive | th | hes | shes | its | dont | do |   *"

    text = text.lower()
    text = re.sub(removalRegex , '', text)
    text = re.sub(leaveNothingBehind , '', text)
    text = re.sub(removeUncommonStops , '', text)

    cleanText.append([i[0], text])
  
  # tf-idf
  cleanTextDF = pd.DataFrame(cleanText, columns=["itemID", "text"])
  vectorizer = CountVectorizer(stop_words='english') 
  counts = vectorizer.fit_transform(cleanTextDF.loc[:, "text"]) 
  countsDf = pd.DataFrame(counts.A, columns=vectorizer.get_feature_names_out())

  # drop uncommon words, get most important ones sorted
  allWords = list(countsDf.columns)
  recurringImportantWords = []
  
  # remove infrequent words
  for i, item in enumerate(countsDf.sum()):
    if(item >= 10):
      recurringImportantWords.append([allWords[i], item])

  recurringImportantWords = sorted(recurringImportantWords, key=lambda x: -x[1])
            # print(len(recurringImportantWords), recurringImportantWords)
            # print(np.array(recurringImportantWords[1]))

  # create TF-IDF vectors for each tweet
  tfidfVector = np.zeros((len(data.values), len(recurringImportantWords)), dtype=np.float32)
  indexList = []
  print(tfidfVector.shape)

  N = len(cleanTextDF.values)
  idf = np.log(N * np.reciprocal(np.array(np.array(recurringImportantWords)[:,1], dtype=np.float32)))

  onlyWords = np.array(recurringImportantWords)[:,0]
  print("Important words: ", onlyWords)


  for i, value in enumerate(cleanTextDF.values):
    # print(i,end="\r")
    words = value[1]
    indexList.append(value[0])

    # get per sentence word counts
    for word in words.split(" "):
      present = np.where(onlyWords == word)

      if len(present[0]) == 0:
        continue
      else:
        tfidfVector[i][int(present[0])] += 1


    print("Sentence: ", i, end="\r")
    tfidfVector[i] = np.log(1 + tfidfVector[i])
    tfidfVector[i] = tfidfVector[i] * idf

  # dimensionality reduction

  # normalise tfids first
  # normalise
  for i in range(1, tfidfVector.shape[1]):
    avg = np.mean(tfidfVector[:,i])
    tfidfVector[:,i] -= avg

    bigness = np.max(tfidfVector[:,i]) - np.min(tfidfVector[:,i])

    if bigness != 0:
      tfidfVector[:,i] /= bigness

  print("Performing SVD...")
  u, s, v = np.linalg.svd(tfidfVector)
  numDimensions = 25

  print("SVD complete")

  reducedTfidVector = np.empty((u.shape[0], numDimensions + 1))
  reducedTfidVector[:,0] = indexList
  reducedTfidVector[:,1:] = u[:,:numDimensions]

  return reducedTfidVector
   


# relevant metadata columns
# relCols = ["id", "fcount", "frndcount", "lcount", "user favcount", "verified", "scount", "rt count", "fav count", "sens"]
relCols = ["id", "fcount",  "frncount", "rt count", "fav count", ]

# return data and test splits
def getFeatures(data, percent=0.9):
  """
  Funtion to return a matrix of dimensions (number of tweets, number of chosen features)
  Input parameters to this funcion are to be chosen as per requirement (Example: Loaded dataframe of the dataset) 
  """

  """
  relevant columns:
  follower count
  friend count
  listed_count
  favourite count
  verified
  statuses_count
  rt count
  fac count
  sensitive
  """

  vecWidth = len(relCols)
  vecHeight = int(len(data.values) * percent)   # for data

  dataFeatureVector = np.zeros((vecHeight, vecWidth), dtype=np.int32)
  testFeatureVector = np.zeros((len(data.values) - vecHeight, vecWidth), dtype=np.int32)

  columns = list(data.columns)


  testCount = 0
  for i, item in enumerate(data.values):    
    feature = np.zeros(vecWidth)

    feature[0] = item[columns.index("itemID")]                                                    # ID
    feature[1] = item[columns.index("user")]["followers_count"]
    feature[2] = item[columns.index("user")]["friends_count"]
    feature[3] = item[columns.index("retweet_count")]
    feature[4] = item[columns.index("favorite_count")]
    # feature[5] = item[columns.index("user")]["friends_count"]
    # feature[6] = item[columns.index("user")]["verified"]
    # feature[7] = item[columns.index("user")]["statuses_count"]
    # feature[8] = item[columns.index("retweet_count")]
    # feature[9] = item[columns.index("favorite_count")]
    
    if i < vecHeight:
      dataFeatureVector[i] = feature
    else:
      testFeatureVector[testCount] = feature
      testCount += 1

  return dataFeatureVector, testFeatureVector

data = data.sample(frac=1)

reducedTfidfVector= tweet_vectoriser(data)




(11099, 1710)
Important words:  ['people' 'new' 'like' ... 'winners' 'writer' 'zimbabwe']
Performing SVD...
SVD complete


Perform KNN using the vector obtained from tweet_vectoriser() function. Following are the steps to be followed:

1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values.

In [17]:
# normalise
for i in range(1, reducedTfidfVector.shape[1]):
  avg = np.mean(reducedTfidfVector[:,i])
  reducedTfidfVector[:,i] -= avg

  bigness = np.max(reducedTfidfVector[:,i]) - np.min(reducedTfidfVector[:,i])
  reducedTfidfVector[:,i] /= bigness

print("Reduced tfid vecotr: ", reducedTfidfVector)

pd.options.display.max_colwidth = 1000

Reduced tfid vecotr:  [[ 3.77900000e+03  1.84468802e-02  5.89117135e-03 ... -1.06675952e-02
  -1.39467266e-02 -8.27888883e-02]
 [ 1.03680000e+04  1.81541230e-02  7.16250849e-03 ...  1.37821968e-01
  -1.38169090e-02  1.35356766e-01]
 [ 4.17600000e+03  1.83697598e-02 -9.36803597e-02 ... -4.88249747e-03
  -2.73822095e-02  4.12887626e-02]
 ...
 [ 3.65800000e+03  1.83151175e-02  4.75044741e-03 ... -1.31452876e-04
  -1.64250366e-03 -2.04305443e-02]
 [ 1.10000000e+02  1.90025041e-02  7.04367740e-03 ...  8.11255765e-04
   3.83277688e-03 -2.89465572e-02]
 [ 8.08900000e+03  1.83351230e-02  6.09017296e-03 ... -2.49387832e-03
  -1.01815991e-03 -1.81536116e-02]]


In [19]:
# performing a k-nn search
# data was shuffled initially, no need to shuffle again

dataTfidf = reducedTfidfVector[ :int(0.90 * reducedTfidfVector.shape[0])  , :]
testTfidf = reducedTfidfVector[  int(0.90 * reducedTfidfVector.shape[0]): , :]

def textKNN(k, dataTfidf, testTfidf):
    correctPredictions = 0
    correctPositive = 0
    totalPredictions = len(nTestFeatureVector)
    totalPositives = 0
    predictedPositives = 0
    predictedNegatives = 0

    pptp = 0
    pntp = 0
    pptn = 0
    pntn = 0

    # get the features with the k smallest costs
    for testEntry in tqdm(testTfidf):
        allCosts = dataTfidf - testEntry
        allCosts = np.square(allCosts)
        allCosts = np.sum(allCosts[:, 1:], axis=1)          # 1st column is the ID

        sortedCosts = np.zeros((allCosts.shape[0], 2))
        sortedCosts[:,0] = np.sqrt(allCosts)
        sortedCosts[:,1] = dataTfidf[:,0]

        ind = np.argsort(sortedCosts[:,0])
        sortedCosts = sortedCosts[ind]

        closest = sortedCosts[:k,:]
        # check nbd classes
        tot = 0.0
        predictedClass = 0

        bleep = 0
        for i in closest:
            # print(i.shape)
            # print(i)
            # print(gt[int(i[1])])
            if gt[int(i[1])] == 1.0:
                bleep += 1
            tot += gt[int(i[1])]
        
        # if bleep > 0:
            # print(bleep, testEntry[0])

        if float(tot)/k < 0.5:
            predictedClass = 0
        else:
            predictedClass = 1

        if gt[int(testEntry[0])] == predictedClass:
            correctPredictions += 1
            if predictedClass == 1:
                correctPositive += 1

        if gt[int(testEntry[0])] == 1:
            totalPositives += 1

        if predictedClass == 1:
            predictedPositives += 1
            if gt[int(testEntry[0])] == 0:
                pptn += 1
            else:
                pptp += 1

        else:
            predictedNegatives += 1
            if gt[int(testEntry[0])] == 0:
                pntn += 1
            else:
                pntp += 1
    
    

    print("pptp: ", pptp, " | pptn: ", pptn, " | pntp: ", pntp, " | pntn: ", pntn)
    accuracy = float(pptp + pntn)/(pptp + pptn + pntp + pntn)
    positiveAccuracy = float(pptp)/(pptp+pntp)
    return accuracy, positiveAccuracy, correctPositive, totalPositives, predictedPositives, predictedNegatives

# loop over all ks, calculate accuracy and percent of correctly predicted positives
maxK = 10
for k in range(1, 10):
    accuracy, positiveAccuracy, correctPositive, tp, pp, pn = textKNN(k, dataTfidf, testTfidf)
    print("K = ", k, " | Accuracy = ", accuracy, " | Correctly predicted positives % = ", positiveAccuracy, " | correct positive number = ", correctPositive, " out of ", tp, " | pp = ", pp, " | pn = ", pn)


100%|██████████| 1110/1110 [00:01<00:00, 788.73it/s]


pptp:  6  | pptn:  72  | pntp:  85  | pntn:  947
K =  1  | Accuracy =  0.8585585585585586  | Correctly predicted positives % =  0.06593406593406594  | correct positive number =  6  out of  91  | pp =  78  | pn =  1032


100%|██████████| 1110/1110 [00:01<00:00, 811.78it/s]


pptp:  15  | pptn:  129  | pntp:  76  | pntn:  890
K =  2  | Accuracy =  0.8153153153153153  | Correctly predicted positives % =  0.16483516483516483  | correct positive number =  15  out of  91  | pp =  144  | pn =  966


100%|██████████| 1110/1110 [00:01<00:00, 854.53it/s]


pptp:  1  | pptn:  18  | pntp:  90  | pntn:  1001
K =  3  | Accuracy =  0.9027027027027027  | Correctly predicted positives % =  0.01098901098901099  | correct positive number =  1  out of  91  | pp =  19  | pn =  1091


100%|██████████| 1110/1110 [00:01<00:00, 845.14it/s]


pptp:  1  | pptn:  37  | pntp:  90  | pntn:  982
K =  4  | Accuracy =  0.8855855855855855  | Correctly predicted positives % =  0.01098901098901099  | correct positive number =  1  out of  91  | pp =  38  | pn =  1072


100%|██████████| 1110/1110 [00:01<00:00, 856.37it/s]


pptp:  0  | pptn:  6  | pntp:  91  | pntn:  1013
K =  5  | Accuracy =  0.9126126126126126  | Correctly predicted positives % =  0.0  | correct positive number =  0  out of  91  | pp =  6  | pn =  1104


100%|██████████| 1110/1110 [00:01<00:00, 836.04it/s]


pptp:  0  | pptn:  10  | pntp:  91  | pntn:  1009
K =  6  | Accuracy =  0.909009009009009  | Correctly predicted positives % =  0.0  | correct positive number =  0  out of  91  | pp =  10  | pn =  1100


100%|██████████| 1110/1110 [00:01<00:00, 858.27it/s]


pptp:  0  | pptn:  0  | pntp:  91  | pntn:  1019
K =  7  | Accuracy =  0.918018018018018  | Correctly predicted positives % =  0.0  | correct positive number =  0  out of  91  | pp =  0  | pn =  1110


100%|██████████| 1110/1110 [00:01<00:00, 791.63it/s]


pptp:  0  | pptn:  3  | pntp:  91  | pntn:  1016
K =  8  | Accuracy =  0.9153153153153153  | Correctly predicted positives % =  0.0  | correct positive number =  0  out of  91  | pp =  3  | pn =  1107


100%|██████████| 1110/1110 [00:01<00:00, 844.80it/s]

pptp:  0  | pptn:  0  | pntp:  91  | pntn:  1019
K =  9  | Accuracy =  0.918018018018018  | Correctly predicted positives % =  0.0  | correct positive number =  0  out of  91  | pp =  0  | pn =  1110





## Part-3
### Subpart-1

Combine both the vectors obtained from the tweet_vectoriser() and get_features()

In [20]:
# your code here

def combineVectors(dataVector, tfidfVector):
    combinedVector = np.empty((dataVector.shape[0], dataVector.shape[0] + tfidfVector.shape[0] - 1))
    for dataEntry, tfidfEntry in zip(dataVector, tfidfVector):
        combinedVector[dataEntry[0],:dataEntry.shape] = dataEntry
        combinedVector[tfidfEntry[0], dataEntry.shape:] = tfidfEntry[:-1]
    
    return combinedVector

combineVectors(n)

[1 2 3 4]


Perform KNN using the vector obtained in the previous step. Following are the steps to be followed:

1. Normalise the vectors
2. Split the data into training and test to estimate the performance.
3. Fit the Nearest Neughbiurs module to the training data and obtain the predicted class by getting the nearest neighbours on the test data.
4. Report the accuracy, chosen k-value and method used to obtain the predicted class. Hint: Plot accuracies for a range of k-values.

In [10]:
# your code here

### Subpart-2

Explain the differences between the accuracies obtained in each part above based on the features used.