# Lab-3 Probabilistic Classification

1) Sentence classification: 

Consider the files traindata.csv and testdata.csv. In these files, each row contains a sentence which belongs to one of 4 categories (science, sports, business, covid crisis). Learn a Naive Bayes classifier to predict the category of each sentence, based on the words in it (neglecting stop words). Use the training set to estimate the prior distribution over the class labels and class-conditional probabilities, i.e. the probability of each word occurring in a sentence having a particular class label. For each test sentence, your output should be the posterior distribution over the labels.

[Trick: never set p(w|Y=k)=0 for any word w and label k, even if word w never exists in any sentence with label k. Assign a small probability like 0.01. Adjust the probabilities of other words too, such that you get a proper conditional distribution]

i) Construct the vocabulary without stop-words [2 marks]

ii) Calculate the prior distribution of the labels [1 mark]

iii) Calculate the class-conditional probabilities of each word in the vocabulary, for each topic [4 marks]

iv) For each test sentence, create the posterior distribution over the labels [3 marks]

Part 1 - Vocabulary Creation

In [1]:
import pandas as pd
import numpy as np
import string

In [2]:
train=pd.read_csv('traindata.csv')
train.drop([ "Unnamed: 2", "Unnamed: 3","Unnamed: 4","Unnamed: 5"], axis =1 , inplace = True)

In [3]:
train.head()

Unnamed: 0,category,text
0,science,Outer space is not friendly to life. Extreme t...
1,sports,"Tennis, original name lawn tennis, game in whi..."
2,business,One woman who frequently flew on Southwest was...
3,covid,"In December 2019, almost seven years after the..."
4,science,Any life-forms that somehow find themselves in...


In [4]:
train['text'][0]

'Outer space is not friendly to life. Extreme temperatures, low pressure and radiation can quickly degrade cell membranes and destroy DNA.'

In [5]:
#importing the stop words
from nltk.corpus import stopwords

stop = stopwords.words('english')

In [6]:
#splitting the sentences into words to remove stop words
train["text"]= train["text"].str.lower().str.split()
train.head()

Unnamed: 0,category,text
0,science,"[outer, space, is, not, friendly, to, life., e..."
1,sports,"[tennis,, original, name, lawn, tennis,, game,..."
2,business,"[one, woman, who, frequently, flew, on, southw..."
3,covid,"[in, december, 2019,, almost, seven, years, af..."
4,science,"[any, life-forms, that, somehow, find, themsel..."


In [7]:
#removing the stopwords from the words present in the data
train['text']=train['text'].apply(lambda x: [item for item in x if item not in stop])
train.head()

Unnamed: 0,category,text
0,science,"[outer, space, friendly, life., extreme, tempe..."
1,sports,"[tennis,, original, name, lawn, tennis,, game,..."
2,business,"[one, woman, frequently, flew, southwest, cons..."
3,covid,"[december, 2019,, almost, seven, years, mers, ..."
4,science,"[life-forms, somehow, find, void, soon, die., ..."


In [8]:
#joining he words to count the frequencies of the words
train["text"]=train["text"].str.join(" ")
train.head()

Unnamed: 0,category,text
0,science,outer space friendly life. extreme temperature...
1,sports,"tennis, original name lawn tennis, game two op..."
2,business,one woman frequently flew southwest constantly...
3,covid,"december 2019, almost seven years mers 2012 ou..."
4,science,life-forms somehow find void soon die. unless ...


In [9]:
#constructing vocabulary from the text given after removing the stop words
df = pd.DataFrame(train.text.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['category', 'freq']
df=df[df['freq'] >= 0]
df

Unnamed: 0,category,freq
0,tennis,12
1,customer,11
2,service,10
3,customers,9
4,employees,9
5,viruses,8
6,space,7
7,every,7
8,ball,7
9,players,6


Part 2 Prior Probabilities

In [10]:
#calcuating the frequncy of prior labels
df_y = pd.DataFrame(train.category.str.split(expand=True).stack().value_counts())
df_y.reset_index(level=0, inplace=True)
df_y.columns = ['category', 'freq']
df_y=df_y[df_y['freq'] >= 0]
df_y

Unnamed: 0,category,freq
0,covid,21
1,sports,20
2,science,20
3,business,19


In [11]:
#finding probability of the prior labels
prob=[]
for i in range(4):
    prob.append(df_y["freq"][i]/80)
df_y["prob"]=prob
df_y

Unnamed: 0,category,freq,prob
0,covid,21,0.2625
1,sports,20,0.25
2,science,20,0.25
3,business,19,0.2375


Part - 3 Finding Class conditional Probabilities 

In [12]:
#diving the given data category wise 
a = train.sort_values(by ='category', ascending = 1) 
business = a[:19]
covid=a[19:40]
science=a[40:60]
sports=a[60:]

In [13]:
#spliting the sentences to remove stop words for respective categories
business["text"]= business["text"].str.lower().str.split()
covid["text"]= covid["text"].str.lower().str.split()
science["text"]= science["text"].str.lower().str.split()
sports["text"]= sports["text"].str.lower().str.split()

#removing the stop words from the words
business['text']=business['text'].apply(lambda x: [item for item in x if item not in stop])
covid['text']=covid['text'].apply(lambda x: [item for item in x if item not in stop])
science['text']=science['text'].apply(lambda x: [item for item in x if item not in stop])
sports['text']=sports['text'].apply(lambda x: [item for item in x if item not in stop])

#joining the words again to calculate the frequncies of the words
business["text"]=business["text"].str.join(" ")
covid["text"]=covid["text"].str.join(" ")
science["text"]=science["text"].str.join(" ")
sports["text"]=sports["text"].str.join(" ")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cavea

In [14]:
#constructing vocabulary from the text given in business category
df_b = pd.DataFrame(business.text.str.split(expand=True).stack().value_counts())
df_b.reset_index(level=0, inplace=True)
df_b.columns = ['category', 'freq']
df_b=df_b[df_b['freq'] >= 0]

#constructing vocabulary from the text given in covid category
df_c = pd.DataFrame(covid.text.str.split(expand=True).stack().value_counts())
df_c.reset_index(level=0, inplace=True)
df_c.columns = ['category', 'freq']
df_c=df_c[df_c['freq'] >= 0]

#constructing vocabulary from the text given in science category
df_sc = pd.DataFrame(science.text.str.split(expand=True).stack().value_counts())
df_sc.reset_index(level=0, inplace=True)
df_sc.columns = ['category', 'freq']
df_sc=df_sc[df_sc['freq'] >= 0]

#constructing vocabulary from the text given in sports category 
df_sp = pd.DataFrame(sports.text.str.split(expand=True).stack().value_counts())
df_sp.reset_index(level=0, inplace=True)
df_sp.columns = ['category', 'freq']
df_sp=df_sp[df_sp['freq'] >= 0]

In [15]:
z,a = df_sp.shape
y,a = df_sc.shape
x,a = df_c.shape
w,a = df_b.shape
n,a = df.shape
print (n,w,x,y,z)

906 258 202 163 334


In [16]:
#finding probabilitites for business category
p_b=np.zeros(n) 
for i in range(w):
    for j in range(n):
        if df['category'][j] == df_b['category'][i] :
            p_b[j]=df_b['freq'][i]/df['freq'][j]
            if p_b[j] == 1:
                p_b[j] = 0.999
for i in range(n):
    if p_b[i] == 0:
        p_b[i] =0.001

#finding probabilitites for covid category 
p_c=np.zeros(n)
for i in range(x):
    for j in range(n):
        if df['category'][j] == df_c['category'][i] :
            p_c[j]=df_c['freq'][i]/df['freq'][j]
            if p_c[j] == 1:
                p_c[j] = 0.999
for i in range(n):
    if p_c[i] == 0:
        p_c[i] =0.001

#finding probabilitites for science category
p_sc=np.zeros(n)
for i in range(y):
    for j in range(n):
        if df['category'][j] == df_sc['category'][i] :
            p_sc[j]=df_sc['freq'][i]/df['freq'][j]
            if p_sc[j] == 1:
                p_sc[j] = 0.999
for i in range(n):
    if p_sc[i] == 0:
        p_sc[i] =0.001

#finding probabilitites for sports category
p_sp=np.zeros(n)
for i in range(z):
    for j in range(n):
        if df['category'][j] == df_sp['category'][i] :
            p_sp[j]=df_sp['freq'][i]/df['freq'][j]
            if p_sp[j] == 1:
                p_sp[j] = 0.999
for i in range(n):
    if p_sp[i] == 0:
        p_sp[i] =0.001

In [17]:
#creating the class- conditional distribution for each category 
df = pd.concat([df, pd.Series(p_b, index=df.index, name='P(w/y=b)')], axis=1)
df = pd.concat([df, pd.Series(p_c, index=df.index, name='P(w/y=c)')], axis=1)
df = pd.concat([df, pd.Series(p_sc, index=df.index, name='P(w/y=sc)')], axis=1)
df = pd.concat([df, pd.Series(p_sp, index=df.index, name='P(w/y=sp)')], axis=1)
df

Unnamed: 0,category,freq,P(w/y=b),P(w/y=c),P(w/y=sc),P(w/y=sp)
0,tennis,12,0.001000,0.001,0.001000,0.999000
1,customer,11,0.999000,0.001,0.001000,0.001000
2,service,10,0.300000,0.001,0.001000,0.700000
3,customers,9,0.999000,0.001,0.001000,0.001000
4,employees,9,0.999000,0.001,0.001000,0.001000
5,viruses,8,0.001000,0.999,0.001000,0.001000
6,space,7,0.001000,0.001,0.999000,0.001000
7,every,7,0.714286,0.001,0.142857,0.142857
8,ball,7,0.001000,0.001,0.001000,0.999000
9,players,6,0.001000,0.001,0.001000,0.999000


Part - 4 Predicting the Test cases

In [18]:
test=pd.read_csv('testdata.csv')

In [19]:
test.head()

Unnamed: 0,category,text
0,science,"He estimates that 1,000-micrometer pellets cou..."
1,science,“That’s enough time to potentially get to Mars...
2,science,How exactly clumps of microbes might get expel...
3,science,The microbes might get kicked up by small mete...
4,science,"Someday, if microbial life is ever discovered ..."


In [20]:
#splitting the sentences into words to remove stop words
test["text"]= test["text"].str.lower().str.split()
test.head()

Unnamed: 0,category,text
0,science,"[he, estimates, that, 1,000-micrometer, pellet..."
1,science,"[“that’s, enough, time, to, potentially, get, ..."
2,science,"[how, exactly, clumps, of, microbes, might, ge..."
3,science,"[the, microbes, might, get, kicked, up, by, sm..."
4,science,"[someday,, if, microbial, life, is, ever, disc..."


In [21]:
#removing the stopwords from the words present in the data
test['text']=test['text'].apply(lambda x: [item for item in x if item not in stop])
test.head()

Unnamed: 0,category,text
0,science,"[estimates, 1,000-micrometer, pellets, could, ..."
1,science,"[“that’s, enough, time, potentially, get, mars..."
2,science,"[exactly, clumps, microbes, might, get, expell..."
3,science,"[microbes, might, get, kicked, small, meteorit..."
4,science,"[someday,, microbial, life, ever, discovered, ..."


In [22]:
a,b = test.shape

In [23]:
#Finding the probabilities for business label
p1=np.zeros(a)
for i in range(a):
    p1[i]=1
    for j in range(len(test['text'][i])):
        for k in range(n):
            if test['text'][i][j] == df["category"][k]:
                p1[i]=p1[i]*df["P(w/y=b)"][k]
    p1[i]=p1[i]*df_y['prob'][0]

#Finding the probabilities for covid label
p2=np.zeros(a)
for i in range(a):
    p2[i]=1
    for j in range(len(test['text'][i])):
        for k in range(n):
            if test['text'][i][j] == df["category"][k]:
                p2[i]=p2[i]*df["P(w/y=c)"][k]
    p2[i]=p2[i]*df_y['prob'][1]

#Finding the probabilities for Science label
p3=np.zeros(a)
for i in range(a):
    p3[i]=1
    for j in range(len(test['text'][i])):
        for k in range(n):
            if test['text'][i][j] == df["category"][k]:
                p3[i]=p3[i]*df["P(w/y=sc)"][k]
    p3[i]=p3[i]*df_y['prob'][2]

#Finding the probabilities for sports label
p4=np.zeros(a)
for i in range(a):
    p4[i]=1
    for j in range(len(test['text'][i])):
        for k in range(n):
            if test['text'][i][j] == df["category"][k]:
                p4[i]=p4[i]*df["P(w/y=sp)"][k]
    p4[i]=p4[i]*df_y['prob'][3]

b=np.zeros(a)
c=np.zeros(a)
sc=np.zeros(a)
sp=np.zeros(a)
for i in range(a):
    b[i]=p1[i]/(p1[i]+p2[i]+p3[i]+p4[i])
    c[i]=p2[i]/(p1[i]+p2[i]+p3[i]+p4[i])
    sc[i]=p3[i]/(p1[i]+p2[i]+p3[i]+p4[i])
    sp[i]=p4[i]/(p1[i]+p2[i]+p3[i]+p4[i])
    
test = pd.concat([test, pd.Series(b, index=test.index, name='Business')], axis=1)
test = pd.concat([test, pd.Series(c, index=test.index, name='Covid')], axis=1)
test = pd.concat([test, pd.Series(sc, index=test.index, name='Science')], axis=1)
test = pd.concat([test, pd.Series(sp, index=test.index, name='Sports')], axis=1)
pd.options.display.float_format = '{:.10f}'.format
test

Unnamed: 0,category,text,Business,Covid,Science,Sports
0,science,"[estimates, 1,000-micrometer, pellets, could, ...",0.0,0.0,1.0,0.0
1,science,"[“that’s, enough, time, potentially, get, mars...",7.021e-07,3e-10,0.9999992966,1e-09
2,science,"[exactly, clumps, microbes, might, get, expell...",1.1e-09,0.0,0.9999999989,0.0
3,science,"[microbes, might, get, kicked, small, meteorit...",0.0,0.0,1.0,0.0
4,science,"[someday,, microbial, life, ever, discovered, ...",1.0521e-06,1.002e-06,0.999996994,9.519e-07
5,sports,"[tennis,, service, correctly, returned,, playe...",0.0,0.0,0.0,1.0
6,sports,"[may, occur, tennis, player, fails, hit, ball,...",0.0,0.0,0.0,1.0
7,sports,"[win, game,, tennis, player, must, win, four, ...",0.0,0.0,0.0,1.0
8,sports,"[tennis,, never, satisfactorily, explained, th...",0.0,0.0,1.1e-09,0.9999999989
9,sports,"[tennis,, server’s, score, called, first;, thu...",0.0,0.0,0.0,1.0


In [24]:
#prdicting the Y_labels
predict=[]
for i in range(a):
    if(max(test['Business'][i],test['Science'][i],test['Sports'][i],test['Covid'][i])==test['Business'][i]):
        predict.append('business')
    elif(max(test['Business'][i],test['Science'][i],test['Sports'][i],test['Covid'][i])==test['Covid'][i]):
        predict.append('covid')
    elif(max(test['Business'][i],test['Science'][i],test['Sports'][i],test['Covid'][i])==test['Science'][i]):
        predict.append('science')
    elif(max(test['Business'][i],test['Science'][i],test['Sports'][i],test['Covid'][i])==test['Sports'][i]):
        predict.append('sports')
test["Predicted"]=predict
test

Unnamed: 0,category,text,Business,Covid,Science,Sports,Predicted
0,science,"[estimates, 1,000-micrometer, pellets, could, ...",0.0,0.0,1.0,0.0,science
1,science,"[“that’s, enough, time, potentially, get, mars...",7.021e-07,3e-10,0.9999992966,1e-09,science
2,science,"[exactly, clumps, microbes, might, get, expell...",1.1e-09,0.0,0.9999999989,0.0,science
3,science,"[microbes, might, get, kicked, small, meteorit...",0.0,0.0,1.0,0.0,science
4,science,"[someday,, microbial, life, ever, discovered, ...",1.0521e-06,1.002e-06,0.999996994,9.519e-07,science
5,sports,"[tennis,, service, correctly, returned,, playe...",0.0,0.0,0.0,1.0,sports
6,sports,"[may, occur, tennis, player, fails, hit, ball,...",0.0,0.0,0.0,1.0,sports
7,sports,"[win, game,, tennis, player, must, win, four, ...",0.0,0.0,0.0,1.0,sports
8,sports,"[tennis,, never, satisfactorily, explained, th...",0.0,0.0,1.1e-09,0.9999999989,sports
9,sports,"[tennis,, server’s, score, called, first;, thu...",0.0,0.0,0.0,1.0,sports
