<center><h2>Artificial and Computational Intelligence (Assignment - 2)</h2></center>

## Problem Statement

As part of the 2nd Assignment, we'll implement Bayesian Networks and also learn to use the pomegranate library.

You are required to create a bayesian network model which would help you predict the probability. The detailed problem description is attached as a PDF as a part of this assignment along with the marking scheme.  

### What is a Bayesian Network ?

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). 

Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. 

### Dataset

The dataset can be downloaded from https://drive.google.com/drive/folders/1oMtKmmvPkN4O8DmrHMJe6M8CbB93Z5kw .You can access it only using your BITS IDs. Also, the same dataset is attached along with the assignment. 

#### Dataset Description
##### Sample Tuple

Y	won	5wickets	lost	2nd	vWest_Indies	Home	6-Nov-11

##### Explanation
- The first column represents if Ashwin was in the playing 11 or not. 
- The second column represents the Result of the match . win indicates India won the match.
- The third column represents the Margin of victory / losss.
- The fourth column represents the results of the toss. won indicates India won the toss. 
- The fifth column represents the batting order. If India batted 1st or 2nd. 
- The sixth column represents the opponent.
- The seventh column represents the location of the match. If the match was held in Home(India) or away. 
- The last column represents the start date of the match.


### Evaluation
We wish to evaluate based on 
- coding practices being followed
- commenting to explain the code and logic behind doing something
- your understanding and explanation of data
- how good the model would perform

### BITS RollNumbers , Names. 
ACI_GROUP013

LOVEL SETIA 2018ap04502

PRAVIN PAWAR 2018ap04559

SOLANKI VINAYKUMAR NANUBHAI 2018AP04509

KUSHALI ALIAS ALKESH ESSO PRABHU DESSAI 2018ap04534

### Assumptions
We have calculated all probabilities using data provided and no domain knowledge is used to presume any probabilities.

We observe that for Ashwin node, CPT is missing location="Home" and Ashwin Playing ="N" since Ashwin has played all home matches. Hence we have assumed probability for this scenario as 0

We observe that for Result node, CPT is missing Ashwin Playing ="N", batting ="2nd" and result="won" since India has not won match batting second where Ashwin has not played. Hence we have assumed probability for this scenario as 0

### Import libraries

In [1]:
from pomegranate import *
import pandas as pd

### Read data

In [2]:
df = pd.read_excel("India_Test_Stats.xlsx")
df.head()

Unnamed: 0,Ashwin,Result,Margin,Toss,Bat,Opposition,Location,Start Date
0,Y,won,5 wickets,lost,2nd,v West Indies,Home,2011-11-06
1,Y,won,inns & 15 runs,won,1st,v West Indies,Home,2011-11-14
2,Y,draw,-,lost,2nd,v West Indies,Home,2011-11-22
3,Y,lost,122 runs,lost,2nd,v Australia,Away,2011-12-26
4,Y,lost,inns & 68 runs,won,1st,v Australia,Away,2012-01-03


In [3]:
df.shape

(85, 8)

### Pre-process data


Observation: No duplicate data

In [4]:
df.duplicated().sum()

0

- Start date, Margin are removed since they dont play any role in determining result
- Opposition is removed since we are being asked to consider only following 5 variables as nodes for Bayesian network:
     
     a. Test Location
     
     b. Ashwin Playing
     
     c. Toss
     
     d. Result
     
     e. Batting

In [5]:
df = df.drop('Start Date', axis=1)
df = df.drop('Margin', axis=1)
df = df.drop('Opposition', axis=1)

Observation: No null values

In [6]:
df.isnull().any()

Ashwin      False
Result      False
Toss        False
Bat         False
Location    False
dtype: bool

### Data Description

In [7]:
df.head()

Unnamed: 0,Ashwin,Result,Toss,Bat,Location
0,Y,won,lost,2nd,Home
1,Y,won,won,1st,Home
2,Y,draw,lost,2nd,Home
3,Y,lost,lost,2nd,Away
4,Y,lost,won,1st,Away


In [8]:
df.shape

(85, 5)

In [9]:
df.describe(include='all')

Unnamed: 0,Ashwin,Result,Toss,Bat,Location
count,85,85,85,85,85
unique,2,3,2,2,2
top,Y,won,lost,1st,Home
freq,70,47,45,46,43


### Construction of Bayesian Network

Solution for part 1 

In [10]:
def calc_prior_probility(arr):
    prob = {}
    uniqueValues = arr.unique()
    rowCount = arr.count()
    summary = arr.value_counts() 
    for uniqueItem in uniqueValues:
        prob[uniqueItem] = float(summary[uniqueItem]/rowCount)
    return prob

Solution for part 2 

In [11]:
# calculate conditional probability of arr2 given arr1
def calc_conditional_probility_2arrays(df, arr1, arr2):
    prob = df.groupby(arr1)[arr2].value_counts() / df.groupby(arr1)[arr2].count()
    list = [[a, b, c] for (a,b), c in prob.to_dict().items()]
    return list
# calculate conditional probability of arr3 given arr1, arr2, arr3
def calc_conditional_probility_3arrays(df, arr1, arr2, arr3):

    prob = df.groupby([arr1,arr2])[arr3].value_counts() / df.groupby([arr1,arr2])[arr3].count()
    list = [[a, b, c, d] for (a,b,c), d in prob.to_dict().items()]
    return list

Solution for part 3 

Prior probability for toss. We have gone ahead with taking prior probability based on data which is close to 50% which is actual prior probability of winning or losing a toss.

In [12]:
tossPrior = calc_prior_probility(df['Toss'])
tossPrior

{'lost': 0.5294117647058824, 'won': 0.47058823529411764}

Prior probability of location based on data.

In [13]:
locationPrior = calc_prior_probility(df['Location'])
locationPrior

{'Home': 0.5058823529411764, 'Away': 0.49411764705882355}

Conditional probability of Ashwin playing based on location. We observed that Ashwin plays in 100% home matches and % goes down when matche are played away. Hence we concluded that Ashwin playing or not depends on location.

In [14]:
ashwinCond = calc_conditional_probility_2arrays(df, 'Location', 'Ashwin')
ashwinCond

[['Away', 'Y', 0.6428571428571429],
 ['Away', 'N', 0.35714285714285715],
 ['Home', 'Y', 1.0]]

Conditional probability of batting order based on toss. Based on data we observe that batting order depends on toss result.

In [15]:
batCond = calc_conditional_probility_2arrays(df, 'Toss','Bat')
batCond

[['lost', '2nd', 0.7777777777777778],
 ['lost', '1st', 0.2222222222222222],
 ['won', '1st', 0.9],
 ['won', '2nd', 0.1]]

Conditional probability of result based on Ashwin playing and batting order.

In [16]:
resultCond = calc_conditional_probility_3arrays(df, 'Ashwin','Bat', 'Result')
resultCond

[['N', '1st', 'won', 0.5555555555555556],
 ['N', '1st', 'draw', 0.2222222222222222],
 ['N', '1st', 'lost', 0.2222222222222222],
 ['N', '2nd', 'lost', 0.8333333333333334],
 ['N', '2nd', 'draw', 0.16666666666666666],
 ['Y', '1st', 'won', 0.7027027027027027],
 ['Y', '1st', 'lost', 0.1891891891891892],
 ['Y', '1st', 'draw', 0.10810810810810811],
 ['Y', '2nd', 'won', 0.48484848484848486],
 ['Y', '2nd', 'draw', 0.2727272727272727],
 ['Y', '2nd', 'lost', 0.24242424242424243]]

We observe that for Ashwin node, CPT is missing location="Home" and Ashwin Playing ="N" since Ashwin has always played all home matches. Hence we have assumed probability for this scenario as 0

In [17]:
ashwinCond.append(['Home', 'N', 0.0])

We observe that for Result node, CPT is missing Ashwin Playing ="N", batting ="2nd" and result="won" since India has not won match batting second where Ashwin has not played. Hence we have assumed probability for this scenario as 0

In [18]:
resultCond.append(['N','2nd','won',0.0])

In [19]:
location = DiscreteDistribution( locationPrior )
toss = DiscreteDistribution(tossPrior)
bat = ConditionalProbabilityTable(batCond, [toss] )
ashwin = ConditionalProbabilityTable(ashwinCond, [location] )
result = ConditionalProbabilityTable(resultCond, [ashwin,bat])

In [20]:
s1 = State( location, name="location" )
s2 = State( ashwin, name="ashwin" )
s3 = State( toss, name="toss" )
s4 = State( bat, name="bat" )
s5 = State(result, name="result" )

In [21]:
network = BayesianNetwork( "Ashwin selection" )
network.add_states(s1, s2, s3,s4,s5)
network.add_edge(s1, s2)
network.add_edge(s2, s5)
network.add_edge(s3, s4)
network.add_edge(s4, s5)
network.bake()

### Solution for part 4

 a) Probability of India winning, batting 2nd, Ashwin playing

In [22]:
p=network.probability([None,'Y',None,'2nd','won'])
print('Probability of India winning, batting 2nd, Ashwin playing: {0:.2f}'.format(p))

Probability of India winning, batting 2nd, Ashwin playing: 0.48


b) Probability of India winning, batting 2nd, Ashwin not playing

In [23]:
p=network.probability([None,'N',None,'2nd','won'])
print('Probability of India winning, batting 2nd, Ashwin not playing: {0:.2f}'.format(p))

Probability of India winning, batting 2nd, Ashwin not playing: 0.00


c) Probability of India losing, batting 2nd, Ashwin playing

In [24]:
p=network.probability([None,'Y',None,'2nd','lost'])
print('Probability of India losing, batting 2nd, Ashwin playing: {0:.2f}'.format(p))

Probability of India losing, batting 2nd, Ashwin playing: 0.24


d) Probability of India losing, batting 2nd, Ashwin not playing

In [25]:
p=network.probability([None,'N',None,'2nd','lost'])
print('Probability of India losing, batting 2nd, Ashwin not playing: {0:.2f}'.format(p))

Probability of India losing, batting 2nd, Ashwin not playing: 0.83


### Conclusion

There are 48% chances of India winning with India batting second and Ashwin playing.

There are 0% chances of India winning with India batting second and Ashwin not playing.

There are 24% chances of India losing with India batting second and Ashwin playing.

There are 83% chances of India losing with India batting second and Ashwin not playing.

These numbers are in favour of Ashwin's case and should aid selectors in making right decision after considering all other factors.

<h3><center> End!</center></h3>