<a href="https://colab.research.google.com/github/amazighy/DataAnalysis/blob/master/dataMiningHW5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd 
import numpy as np 

# **For easy access we placed the data in a github repository**
### **here we import the data as a pandas dataframe**

In [None]:

df = pd.read_csv("https://raw.githubusercontent.com/amazighy/DataAnalysis/master/d-clean.csv")


### **here read the first few rows of the data**

In [None]:
df.head()

Unnamed: 0,cluster,Gender,Age,Ethnicity,Income,Work hours,Health condition,Education,Motivation,Attitude,Intention,Ownership,1,ER,2,ER.1,3,ER.2,4,ER.3,5,ER.4,6,ER.5,7,ER.6,8,ER.7,9,ER.8,10,ER.9,11,ER.10,12,ER.11,13,ER.12,14,ER.13,15,ER.14
0,0,Female,44,Caucasian,"150,000-199,999",50+ hrs/week,Good,4-year college degree,0.265623,0.52659,0.507169,0.559363,Phy,0.166667,diet,0.714286,diet,0.428571,diet,0.571429,diet,0.75,diet,0.428571,diet,0.857143,diet,1,diet,1,diet,0.285714,diet,0.571429,diet,0.5,Phy,0,Phy,0.0,Phy,0.0
1,1,Female,37,Asian,"100,000-149,999",16-35 hrs/week,Excellent,2-year college degree,0.386128,0.896973,0.736022,0.924759,diet,0.0,diet,0.428571,diet,1.0,diet,0.857143,diet,0.875,diet,0.857143,diet,1.0,diet,1,diet,1,diet,0.857143,diet,0.857143,diet,1.0,diet,1,diet,0.857143,diet,0.428571
2,2,Female,60,African American,"0-24,999",50+ hrs/week,Excellent,2-year college degree,0.147511,0.593162,0.574159,0.928628,diet,0.0,Phy,0.142857,Phy,0.0,Phy,0.571429,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0.0,Phy,0.0
3,0,Female,62,African American,"0-24,999",1-15 hrs/week,Fair,"Some college, but no degree",0.492041,0.586625,0.645084,0.799052,Phy,0.0,Phy,0.0,Phy,0.467,Phy,0.571429,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0.0,Phy,0.0
4,1,Male,40,HIspanic,"0-24,999",50+ hrs/week,Fair,4-year college degree,0.689156,0.647716,0.738081,0.97589,diet,0.0,diet,0.0,diet,0.285714,diet,0.0,diet,0.125,diet,0.428571,diet,0.0,diet,0,diet,0,diet,0.0,diet,0.0,diet,0.0,diet,0,diet,0.0,diet,0.0


To develop a probabilistic model we first convert the continuous variables into categorical to calculate the probabilities of persons that belong to similar categories.

In our data, we have **Age** , **Motivation, Attitude, Intention, Ownership.** I discretize these variables into 4 equal-sized buckets based on the following intervals: **[.0 - .25], [.25 - .50], [.50 - .75], [75 - .100].**

For example,

- a person with a motivation of **0.27455 and** another person with the motivation of **0.4677888** will belong to the same category **[.25 - .50].**
- Similarly, people of ages **53 and 73** will belong to the same category **[50 - 75].**

In [None]:
df['Motivation']=(pd.cut(x=df['Motivation'], bins=[0,0.25, 0.5, 0.75,1])).astype('str')
df['Attitude']=(pd.cut(x=df['Attitude'], bins=[0,0.25, 0.5, 0.75,1])).astype('str')
df['Intention']=(pd.cut(x=df['Intention'], bins=[0,0.25, 0.5, 0.75,1])).astype('str')
df['Ownership']=(pd.cut(x=df['Ownership'], bins=[0,0.25, 0.5, 0.75,1])).astype('str')
df['Age']=(pd.cut(x=df['Age'], bins=[0,25, 50, 75,100])).astype('str')

#### **here we transform the data types into numeric for further processing**

In [None]:
# changing the data types of the ERs
for col in df.columns:
    if 'ER' in col:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# **Here we add 4 new columns (features) to the data to capture:**


1.  **avg_diet** to capture the **average engagement** ratio when recommended **diet**
2.  **avg_Pyh** to capture the **average engagement** ratio when recommended **Phy**
3.  **num_diet** to capture the number of times **diet** was recommended
4.  **num_Pyh** to capture the number of times **Phy** was recommended


In [None]:
df['avg_diet']=np.nan
df['avg_Pyh']=np.nan
df['num_diet']=0
df['num_Pyh']=0

In [None]:
# this piese of code calculates the average response of all the participents 
# to physical activity and diet. it is not much but it certainly adds a lot of informarion.
# we could capture how people react to diet and physical activity. some people might 
# hate diet and new columns 'avg_diet' would tell us that.Similarly, if peole hate physical activity 
# the avg_Pyh would tell us. 
for idx in df.iloc[:, 12:43].index:
    PersonData=df.iloc[:, 12:43].loc[idx, :].values.tolist()
    keys = []
    values =[]
    for i, element in enumerate(PersonData):
        if i%2 ==0:
            keys.append(element)
        else:
            values.append(element)
    tuples=list(zip(keys,values))
    Phy=[]
    diet =[]
    for p in tuples:
        if p[0]=='Phy':
             Phy.append(p[1])
        elif p[0]=='diet':
            diet.append(p[1])
    
            
    if len(diet)==0:
        df.loc[idx,'avg_diet']=np.nan
    else:
        avg_diet=sum(diet)/len(diet)
        df.loc[idx,'avg_diet'] = avg_diet
        df.loc[idx,'num_diet'] = len(diet)
        
    if len(Phy)==0:
        df.at[idx, 'avg_Pyh']=np.nan
    else:
        avg_Pyh=sum(Phy)/len(Phy)
        df.at[idx, 'avg_Pyh']=avg_Pyh
        df.loc[idx,'num_Pyh'] = len(Phy)

In [None]:
df

Unnamed: 0,cluster,Gender,Age,Ethnicity,Income,Work hours,Health condition,Education,Motivation,Attitude,Intention,Ownership,1,ER,2,ER.1,3,ER.2,4,ER.3,5,ER.4,6,ER.5,7,ER.6,8,ER.7,9,ER.8,10,ER.9,11,ER.10,12,ER.11,13,ER.12,14,ER.13,15,ER.14,avg_diet,avg_Pyh,num_diet,num_Pyh
0,0,Female,"(25, 50]",Caucasian,"150,000-199,999",50+ hrs/week,Good,4-year college degree,"(0.25, 0.5]","(0.5, 0.75]","(0.5, 0.75]","(0.5, 0.75]",Phy,0.166667,diet,0.714286,diet,0.428571,diet,0.571429,diet,0.75,diet,0.428571,diet,0.857143,diet,1,diet,1,diet,0.285714,diet,0.571429,diet,0.5,Phy,0,Phy,0.0,Phy,0.0,0.646104,0.041667,11,4
1,1,Female,"(25, 50]",Asian,"100,000-149,999",16-35 hrs/week,Excellent,2-year college degree,"(0.25, 0.5]","(0.75, 1.0]","(0.5, 0.75]","(0.75, 1.0]",diet,0.0,diet,0.428571,diet,1.0,diet,0.857143,diet,0.875,diet,0.857143,diet,1.0,diet,1,diet,1,diet,0.857143,diet,0.857143,diet,1.0,diet,1,diet,0.857143,diet,0.428571,0.80119,,15,0
2,2,Female,"(50, 75]",African American,"0-24,999",50+ hrs/week,Excellent,2-year college degree,"(0.0, 0.25]","(0.5, 0.75]","(0.5, 0.75]","(0.75, 1.0]",diet,0.0,Phy,0.142857,Phy,0.0,Phy,0.571429,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0.0,Phy,0.0,0.0,0.05102,1,14
3,0,Female,"(50, 75]",African American,"0-24,999",1-15 hrs/week,Fair,"Some college, but no degree","(0.25, 0.5]","(0.5, 0.75]","(0.5, 0.75]","(0.75, 1.0]",Phy,0.0,Phy,0.0,Phy,0.467,Phy,0.571429,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0.0,Phy,0.0,,0.069229,0,15
4,1,Male,"(25, 50]",HIspanic,"0-24,999",50+ hrs/week,Fair,4-year college degree,"(0.5, 0.75]","(0.5, 0.75]","(0.5, 0.75]","(0.75, 1.0]",diet,0.0,diet,0.0,diet,0.285714,diet,0.0,diet,0.125,diet,0.428571,diet,0.0,diet,0,diet,0,diet,0.0,diet,0.0,diet,0.0,diet,0,diet,0.0,diet,0.0,0.055952,,15,0
5,2,Male,"(25, 50]",Asian,"100,000-149,999",36-50 hrs/week,Good,"Some college, but no degree","(0.5, 0.75]","(0.5, 0.75]","(0.5, 0.75]","(0.5, 0.75]",Phy,0.0,diet,0.0,diet,0.0,diet,0.0,diet,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0,Phy,0.0,diet,0.0,diet,0.0,Phy,0,Phy,0.0,Phy,0.0,0.0,0.0,6,9
6,9,Female,"(50, 75]",Indian/ Asian,,,,,,,,,Phy,0.0,diet,0.0,diet,0.0,diet,0.0,diet,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0,Phy,0.0,Phy,0.0,Phy,0.0,Phy,0,Phy,0.0,Phy,0.0,0.0,0.0,4,11
7,9,Female,"(25, 50]",African American,,,,,,,,,Phy,0.166667,diet,0.142857,diet,0.285714,diet,0.285714,diet,0.125,Phy,0.142857,Phy,0.0,Phy,0,Phy,0,Phy,0.0,diet,0.0,diet,0.0,Phy,0,Phy,0.0,Phy,0.0,0.139881,0.034392,6,9
8,1,Male,"(75, 100]",Caucasian,"25,000-49,000",36-50 hrs/week,Good,2-year college degree,"(0.0, 0.25]","(0.75, 1.0]","(0.75, 1.0]","(0.75, 1.0]",Phy,0.0,diet,0.0,diet,0.0,diet,0.0,diet,0.0,diet,0.0,diet,0.0,diet,0,diet,0,diet,0.0,diet,0.0,diet,0.0,diet,0,diet,0.0,diet,0.0,0.0,0.0,14,1


### **here we create a list of the features that hold the inormation about participents' characteristics**

In [None]:
characteristics=[	"Gender",	"Age",	"Ethnicity", "Income",	"Work hours",	"Health condition",	"Education","Motivation", "Attitude", "Intention", "Ownership"]

# **Getting the joint Probabiltiy of Gender:**



In [None]:
d=df.groupby(['Gender'])[["num_diet", "num_Pyh"]].sum()
d

Unnamed: 0_level_0,num_diet,num_Pyh
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,37,53
Male,35,10


The above table show in the first fifteen weeks of which we are given data for the ten observation (ten participants), a person is more likely to be recommended physical activity if they were a female and 3 times more likely to be recommended diet if they were a male. In other words, during the first fifteen weeks of the experiment,
>* For a female participant:  the App recommended **diet** 37 times and **Pyh** 53 times 

>* For a male participant: the App recommended **diet** 37 times and **Phy** 11 times.

# **Below we calculate the the marginal probability for Gender**

In [None]:
d1=(d.append(d.sum().rename('Total')).assign(Total=lambda d: d.sum(1)))
d1

Unnamed: 0_level_0,num_diet,num_Pyh,Total
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,37,53,90
Male,35,10,45
Total,72,63,135


## **We turn the above table into the Probability distribution**

In [None]:
d1=(d.append(d.sum().rename('Total')).assign(Total=lambda d: d.sum(1))/d.sum().sum()).round(3)
d1

Unnamed: 0_level_0,num_diet,num_Pyh,Total
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,0.274,0.393,0.667
Male,0.259,0.074,0.333
Total,0.533,0.467,1.0


## **The program Below get the probability distribution of all of the characteristics.**
Below we put all the above steps together into one program to calculate the probability distribution for all the characteristics 

In [None]:
for c in characteristics:
  d_p=df.groupby([c])[["num_diet", "num_Pyh"]].sum()
  conditional=(d_p.append(d_p.sum().rename('Total')).assign(Total=lambda d: d.sum(1))/d_p.sum().sum()).round(3)
  print("")
  print("_________________________________________________")
  print('       The probability distrubution of ','\033[1m' +c+ '\033[0m','      ')
  print("=================================================")
  print(conditional)


_________________________________________________
       The probability distrubution of  [1mGender[0m       
        num_diet  num_Pyh  Total
Gender                          
Female     0.274    0.393  0.667
Male       0.259    0.074  0.333
Total      0.533    0.467  1.000

_________________________________________________
       The probability distrubution of  [1mAge[0m       
           num_diet  num_Pyh  Total
Age                                
(25, 50]      0.393    0.163  0.556
(50, 75]      0.037    0.296  0.333
(75, 100]     0.104    0.007  0.111
Total         0.533    0.467  1.000

_________________________________________________
       The probability distrubution of  [1mEthnicity[0m       
                  num_diet  num_Pyh  Total
Ethnicity                                 
African American     0.052    0.281  0.333
Asian                0.156    0.067  0.222
Caucasian            0.185    0.037  0.222
HIspanic             0.111    0.000  0.111
Indian/ Asian        0

#### **Below I wrote a Program that take some person's charactristics as a list of varialbles and it returns the App's prediction for that person.**
To illustrate the logic of how the program makes determinations to recommend diet and physical activity to a person, we will use a person with a defined set of characteristics and see how the Program recommends either Physical activity or diet. Here are our example Person’s characteristics:



In [None]:
characteristics=[	"Gender",	"Age",	"Ethnicity", "Income",	"Work hours",	"Health condition",	"Education","Motivation", "Attitude", "Intention", "Ownership"]
Personcharacteristic =['Female', '(50, 75]','African American','150,000-199,999','16-35 hrs/week','Excellent',
                       '4-year college degree','(0.25, 0.5]','(0.75, 1.0]', '(0.75, 1.0]','(0.5, 0.75]']

**Here we transform the above list of attributes and list of characteristics into an object for better readability**

In [None]:
Participant=dict(zip(characteristics, Personcharacteristic))
Participant

{'Age': '(50, 75]',
 'Attitude': '(0.75, 1.0]',
 'Education': '4-year college degree',
 'Ethnicity': 'African American',
 'Gender': 'Female',
 'Health condition': 'Excellent',
 'Income': '150,000-199,999',
 'Intention': '(0.75, 1.0]',
 'Motivation': '(0.25, 0.5]',
 'Ownership': '(0.5, 0.75]',
 'Work hours': '16-35 hrs/week'}

The program first calculates the probabilities for every characteristic. In our example, the participant is a **female** , so bases on all prior observations we calculate the **joint probabilities for:**

1. Being a **female** and recommended **diet**
2. Being a **female** and recommended **Physical activities.**

**We repeat the same process for all other attributes and we store all the joint probabilities for all the characteristics and store the results in a two list.**

- One called **dietProb** to store all the probabilities for having a characteristic and be recommended **diet**.
- One called **PhyProb** to store all the probabilities for having a characteristic and be recommended **Physical activity**.

Here are the list results for our **example participant**

In [51]:
df_list=[]
for i in Participant.keys():
  df_=df.groupby(i)[["num_diet", "num_Pyh"]].sum()
  conditional=(df_.append(df_.sum().rename('Total')).assign(Total=lambda d: d.sum(1))/df_.sum().sum()).round(2)
  df_list.append(conditional)

dietProb=[]
PhyProb=[]
print("____________________________________________________________________________________________________________________")
print(
                    "{:35.30}".format("Charactetistics "),
                    "{:45.40}".format("Probability to be recommeded diet",),
                    "{:30.40}".format("Probability to be recommeded Phy" ),
                )
print("====================================================================================================================")
for i in range(len(list(Participant.values()))):
  dt=df_list[i]["num_diet"][list(Participant.values())[i]]
  dietProb.append(dt)

  ph=df_list[i]["num_Pyh"][list(Participant.values())[i]]

  PhyProb.append(ph)    
 
                         
  print(
                    "{:45.30}".format(str(list(Participant.keys())[i]+' ---> '+list(Participant.values())[i])),
                    "{:45.30}".format(str(dt)),
                    "{:30.30}".format(str(ph)),
                )
  print("___________________________________________________________________________________________________________________")
print("")
print("This is the list of joint probabiles of person ")

____________________________________________________________________________________________________________________
Charactetistics                     Probability to be recommeded diet             Probability to be recommeded Phy
Gender ---> Female                            0.27                                          0.39                          
___________________________________________________________________________________________________________________
Age ---> (50, 75]                             0.04                                          0.3                           
___________________________________________________________________________________________________________________
Ethnicity ---> African America                0.05                                          0.28                          
___________________________________________________________________________________________________________________
Income ---> 150,000-199,999                   0.1  

Finally, the program recommends either diet or Physical activity as follows;

1. Calculate **P(diet)**

> - **P(diet)=** P(Female **and** diet) P(Age **and** diet)......p(Ownership **and** diet)

2. Calculate **P(physical acitivity)**

> - **P(phy)=** P(Female **and** phy) P(Age **and** phy)......p(Ownership **and** phy)

We then recommend **diet** or **Physical activity** based on the following condition:

- **If P(diet) > P(phy) we recommend diet**
- **Else if P(diet) < P(phy) we recommend physical activity.**

# **Below id the final Frogram function**

In [65]:
def bayesRulePredition(Participant):
  df_list=[]
  for i in Participant.keys():
    df_=df.groupby(i)[["num_diet", "num_Pyh"]].sum()
    conditional=(df_.append(df_.sum().rename('Total')).assign(Total=lambda d: d.sum(1))/df_.sum().sum()).round(2)
    df_list.append(conditional)

  dietProb=[]
  PhyProb=[]
  for i in range(len(list(Participant.values()))):
    dietProb.append(df_list[i]["num_diet"][list(Participant.values())[i]])
    PhyProb.append(df_list[i]["num_Pyh"][list(Participant.values())[i]])
  

  if np.prod(dietProb) > np.prod(PhyProb):
    return 'diet'
  else:
    return "Phy"


# **Let's make a prediction**
Let's use the program to see whether the app recommends diet or Physical activity for a given person

here are the caracteristics of an exemple person:

In [77]:
exemplePerson_1={'Age': '(50, 75]',
 'Attitude': '(0.75, 1.0]',
 'Education': '4-year college degree',
 'Ethnicity': 'African American',
 'Gender': 'Female',
 'Health condition': 'Excellent',
 'Income': '150,000-199,999',
 'Intention': '(0.75, 1.0]',
 'Motivation': '(0.25, 0.5]',
 'Ownership': '(0.5, 0.75]',
 'Work hours': '16-35 hrs/week'}

In [78]:
bayesRulePredition(exemplePerson_1)

'diet'

**We can see that the program recommends diet for exemplePerson_1**

Let's use the program to predict another person, we call here exemplePerson_2

In [81]:
exemplePerson_2={'Age': '(50, 75]',
 'Attitude': '(0.5, 0.75]',
 'Education': '4-year college degree',
 'Ethnicity': 'African American',
 'Gender': 'Female',
 'Health condition': 'Excellent',
 'Income': '0-24,999',
 'Intention': '(0.75, 1.0]',
 'Motivation': '(0.0, 0.25]',
 'Ownership': '(0.5, 0.75]',
 'Work hours': '1-15 hrs/week'}

In [82]:
bayesRulePredition(exemplePerson_2)

'Phy'

**We can see that the program recommends diet for exemplePerson_1**