<a href="https://colab.research.google.com/github/arshika77/Correlation-Matrix-Memory/blob/master/key_data_pairs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Imports

In [2]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import spacy
import re
import string
from collections import Counter
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

##Loading and visualising the data

In [3]:
!unzip "NLP Data.zip" -d "data_directory"

Archive:  NLP Data.zip
   creating: data_directory/NLP Data/data/
  inflating: data_directory/NLP Data/data/train.csv  
  inflating: data_directory/NLP Data/data/val.csv  
  inflating: data_directory/NLP Data/README.md  


In [4]:
!ls

 data_directory  'NLP Data.zip'   sample_data


In [3]:
train_data = pd.read_csv('data_directory/NLP Data/data/train.csv')
val_data = pd.read_csv('data_directory/NLP Data/data/val.csv')

In [4]:
train_data.head()

Unnamed: 0,Title,Post,Flair
0,netflix the family has been an amazing watch d...,netflixs new series the family is about a secr...,10
1,all results are out is iiitm gwalior it branch...,the internet seems to think so average package...,0
2,which are the things you always buy made in india,you can include the reason for your preference...,0
3,weekly coders hackers amp all tech related thr...,last week issue all threads every week on frid...,11
4,what are some good unknown companies to work a...,there are similar posts on other subreddits bu...,0


In [5]:
train_data.groupby('Flair').count()

Unnamed: 0_level_0,Title,Post
Flair,Unnamed: 1_level_1,Unnamed: 2_level_1
0,16614,16578
1,811,804
2,76,71
3,791,783
4,365,365
5,172,171
6,293,290
7,8114,8041
8,96,93
9,1044,1038


In [6]:
print(train_data)

                                                   Title  ... Flair
0      netflix the family has been an amazing watch d...  ...    10
1      all results are out is iiitm gwalior it branch...  ...     0
2      which are the things you always buy made in india  ...     0
3      weekly coders hackers amp all tech related thr...  ...    11
4      what are some good unknown companies to work a...  ...     0
...                                                  ...  ...   ...
36542       how to start trading cryptocurrency in india  ...     0
36543                        may i eat my beef in peace   ...    10
36544                         legal question about metoo  ...     0
36545  are there any andhra university alumni or stud...  ...    14
36546  what are some of the ways i can generate passi...  ...     1

[36547 rows x 3 columns]


In [7]:
val_data.head()

Unnamed: 0,Title,Post,Flair
0,travelling outside india for the first time be...,as the title says i will be travelling outside...,0
1,i am an american traveling to pune for work th...,more info i will be on a work trip this summer...,10
2,roasting channels vs reaction channels,why do you like or dislike any of those,7
3,how the fake dadasaheb phalke awards game the ...,crossposting from rbollywood for more visibili...,10
4,today congress has been greatly weakened or be...,west bengal seats breakaway party all india t...,10


In [8]:
print(val_data)

                                                  Title  ... Flair
0     travelling outside india for the first time be...  ...     0
1     i am an american traveling to pune for work th...  ...    10
2                roasting channels vs reaction channels  ...     7
3     how the fake dadasaheb phalke awards game the ...  ...    10
4     today congress has been greatly weakened or be...  ...    10
...                                                 ...  ...   ...
8996  pserious can sc deliver a judgement similar to...  ...     0
8997  the princess of uae tweeted in response to isl...  ...    10
8998  paskindia why is it the norm to expect blatant...  ...     0
8999  np this is an unsual post i am looking to play...  ...     7
9000  explain to an idiot how aadhar unifying all ou...  ...     0

[9001 rows x 3 columns]


We will assume that flair is decided by considering the title+post as the complete content. Thus, we need to concatentate the title and post coloumns as 'Content' in our model

##Encoding

In [9]:
train_data['Title'] = train_data['Title'].fillna('')
train_data['Post'] = train_data['Post'].fillna('')

In [10]:
train_data['Content'] = train_data['Title'].str.cat(train_data['Post'], sep = " ")
train_data = train_data.drop(['Title','Post'],axis=1)

In [11]:
train_data.head()

Unnamed: 0,Flair,Content
0,10,netflix the family has been an amazing watch d...
1,0,all results are out is iiitm gwalior it branch...
2,0,which are the things you always buy made in in...
3,11,weekly coders hackers amp all tech related thr...
4,0,what are some good unknown companies to work a...


In [12]:
train_data['Content_length'] = train_data['Content'].apply(lambda x: len(x.split()))

train_data.head()

Unnamed: 0,Flair,Content,Content_length
0,10,netflix the family has been an amazing watch d...,70
1,0,all results are out is iiitm gwalior it branch...,38
2,0,which are the things you always buy made in in...,50
3,11,weekly coders hackers amp all tech related thr...,85
4,0,what are some good unknown companies to work a...,84


In [13]:
train_data['Content_length'].mean()

119.29173393164966

Mean length of content is approximately 120 words

###Tokenize the words

In [14]:
tok = spacy.load('en')
def tokenize (text):
    text = re.sub(r"[^\x00-\x7F]+", " ", text)
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]') # remove punctuation and numbers
    nopunct = regex.sub(" ", text.lower())
    return [token.text for token in tok.tokenizer(nopunct)]

In [15]:
counts = Counter()
for index, row in train_data.iterrows():
    counts.update(tokenize(row['Content']))

In [16]:
print("num_words before:",len(counts.keys()))
for word in list(counts):
    if counts[word] < 2:
        del counts[word]
print("num_words after:",len(counts.keys()))

num_words before: 88692
num_words after: 42719


In [17]:
vocab2index = {"":0, "UNK":1}
words = ["", "UNK"]
for word in counts:
    vocab2index[word] = len(words)
    words.append(word)

In [18]:
def encode_sentence(text, vocab2index, N=120):
    tokenized = tokenize(text)
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([vocab2index.get(word, vocab2index["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]
    return encoded, length

In [19]:
train_data['encoded_Content'] = train_data['Content'].apply(lambda x: np.array(encode_sentence(x,vocab2index)))

train_data.head()

Unnamed: 0,Flair,Content,Content_length,encoded_Content
0,10,netflix the family has been an amazing watch d...,70,"[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 13, 5..."
1,0,all results are out is iiitm gwalior it branch...,38,"[[32, 53, 54, 55, 25, 1, 56, 57, 58, 59, 3, 60..."
2,0,which are the things you always buy made in in...,50,"[[82, 54, 3, 83, 11, 84, 85, 86, 30, 87, 11, 6..."
3,11,weekly coders hackers amp all tech related thr...,85,"[[110, 111, 112, 113, 32, 114, 115, 116, 117, ..."
4,0,what are some good unknown companies to work a...,84,"[[14, 54, 48, 59, 158, 159, 18, 160, 15, 30, 8..."


In [20]:
Counter(train_data['Flair'])

Counter({0: 16616,
         1: 811,
         2: 76,
         3: 793,
         4: 365,
         5: 172,
         6: 293,
         7: 8118,
         8: 96,
         9: 1046,
         10: 4847,
         11: 580,
         12: 961,
         13: 256,
         14: 1517})

Dataset has way more 0 flair data points than any other, 7 and 10 are other high values

Very less data points of 8 and 2

Repeat the encoding steps for the validation dataset