# Feature Engineering - 5

Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
you have collected data on the amount of time students spend studying for an exam and their final exam
scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

Ans. Solution using sample data created is as follows:

In [5]:
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Generate sample data 
num_students = 50
study_hours = np.random.randint(0, 10, num_students)  # Simulated study hours (0 to 9)
exam_scores = study_hours * 10 + np.random.randint(-5, 6, num_students)  # Simulated exam scores with noise

# Create a DataFrame
data = {
    "Study Hours": study_hours,
    "Exam Score": exam_scores
}

df = pd.DataFrame(data)

df.head()


Unnamed: 0,Study Hours,Exam Score
0,6,57
1,3,25
2,7,68
3,4,36
4,6,62


In [6]:
df.corr(method='pearson')

Unnamed: 0,Study Hours,Exam Score
Study Hours,1.0,0.993047
Exam Score,0.993047,1.0


**Result :** The correlation between Study hours and exam scores is positive, so we can interpret that when the time spent studying increases, the final exam scores also increases.

Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
Suppose you have collected data on the amount of sleep individuals get each night and their overall job
satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
variables and interpret the result.

Ans. Solution using sample data created is as follows:

In [5]:
import pandas as pd
import numpy as np

np.random.seed(23)

sleep_hours = np.random.randint(2,9,50)
job_satisfaction = sleep_hours*0.9 + np.random.uniform(0,2,50)
job_satisfaction = np.clip(job_satisfaction, 1, 10).astype(int)

data = {'sleep_hours' : sleep_hours,
       'job_satisfaction' : job_satisfaction}


df = pd.DataFrame(data)
df.head()


Unnamed: 0,sleep_hours,job_satisfaction
0,5,6
1,8,8
2,2,2
3,3,4
4,8,7


In [6]:
df.corr(method='spearman')

Unnamed: 0,sleep_hours,job_satisfaction
sleep_hours,1.0,0.921385
job_satisfaction,0.921385,1.0


**Result :** The correlation between hours of sleeping and job satisfaction is positive, so we can interpret that when the time spent sleeping increases, job satisfaction also increases, which is natural as when we have a good sleep or mood remains better.

Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
between these two variables and compare the results.

Ans. Solution using sample data created is as follows:

In [22]:
import pandas as pd
import numpy as np

#dataset from kaggle
df = pd.read_csv('exercise_dataset.csv')

df.head()

Unnamed: 0,ID,Exercise,Calories Burn,Dream Weight,Actual Weight,Age,Gender,Duration,Heart Rate,BMI,Weather Conditions,Exercise Intensity
0,1,Exercise 2,286.959851,91.892531,96.301115,45,Male,37,170,29.426275,Rainy,5
1,2,Exercise 7,343.453036,64.165097,61.104668,25,Male,43,142,21.286346,Rainy,5
2,3,Exercise 4,261.223465,70.846224,71.766724,20,Male,20,148,27.899592,Cloudy,4
3,4,Exercise 5,127.183858,79.477008,82.984456,33,Male,39,170,33.729552,Sunny,10
4,5,Exercise 10,416.318374,89.960226,85.643174,29,Female,34,118,23.286113,Cloudy,3


In [23]:
#dropping the unrequired columns and limiting to 50 data rows:
df = df.drop(['ID', 'Exercise', 'Calories Burn', 'Dream Weight', 'Actual Weight',
       'Age', 'Gender', 'Heart Rate', 'Weather Conditions',
       'Exercise Intensity'], axis=1)[0:50]
df.head()

Unnamed: 0,Duration,BMI
0,37,29.426275
1,43,21.286346
2,20,27.899592
3,39,33.729552
4,34,23.286113


In [24]:
df.corr(method='pearson')

Unnamed: 0,Duration,BMI
Duration,1.0,0.138703
BMI,0.138703,1.0


In [25]:
df.corr(method='spearman')

Unnamed: 0,Duration,BMI
Duration,1.0,0.151927
BMI,0.151927,1.0


**Result :** Both the pearson and spearman correlation coefficient are close to zero, so, we can interpret from this that there is no significant relation between BMI and exercise duration

Q4. A researcher is interested in examining the relationship between the number of hours individuals
spend watching television per day and their level of physical activity. The researcher collected data on
both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
these two variables.

Ans. Solution using sample data created is as follows:

In [28]:
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Generate sample data
num_participants = 50
tv_hours = np.random.uniform(0, 5, num_participants)  # Simulated TV hours (0 to 5 hours)
physical_activity = 10 - tv_hours + np.random.normal(0, 2, num_participants)  #physical activity level with noise

# Create a DataFrame
data = {
    "TV Hours": tv_hours,
    "Physical Activity Level": physical_activity
}

df = pd.DataFrame(data)

df.head()


Unnamed: 0,TV Hours,Physical Activity Level
0,1.872701,9.604233
1,4.753572,5.589165
2,3.65997,6.108734
3,2.993292,6.4045
4,0.780093,6.262863


In [29]:
df.corr(method='pearson')

Unnamed: 0,TV Hours,Physical Activity Level
TV Hours,1.0,-0.651379
Physical Activity Level,-0.651379,1.0


**Result :** Here the correlation coefficient is negative which means that there is a negative correlation. This means that as the number of hours watching TV increases, the amount of physical activity decreases and when the amount of watching TV decreases, the physical activity level increases.

Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:

|Age(Years) | Soft drink Preference|
| --- | -- |
|25 |Coke|
|42 |Pepsi|
|37 |Mountain dew|
|19 |Coke|
|31 |Pepsi|
|28 |Coke|

Ans. Here we have a cateegorical feature so we cannot calculate correlation directly. So we will first apply One Hot Encoding and then calculate correlation.

Solution using python is as follows:

In [36]:
import pandas as pd

data = {'Age':[25,42,37,19,31,28],
       'Sof_Drink_Preference':['Coke','Pepsi','Mountain_dew','Coke','Pepsi','Coke']}

df=pd.DataFrame(data)
df

Unnamed: 0,Age,Sof_Drink_Preference
0,25,Coke
1,42,Pepsi
2,37,Mountain_dew
3,19,Coke
4,31,Pepsi
5,28,Coke


In [39]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe_enc = ohe.fit_transform(df[['Sof_Drink_Preference']])
ohe_enc_df = pd.DataFrame(ohe_enc.toarray(),columns=ohe.get_feature_names_out())
print('One Hot Encoding output : ')
df=pd.concat([df,ohe_enc_df], axis=1)
df

One Hot Encoding output : 


Unnamed: 0,Age,Sof_Drink_Preference,Sof_Drink_Preference_Coke,Sof_Drink_Preference_Mountain_dew,Sof_Drink_Preference_Pepsi
0,25,Coke,1.0,0.0,0.0
1,42,Pepsi,0.0,0.0,1.0
2,37,Mountain_dew,0.0,1.0,0.0
3,19,Coke,1.0,0.0,0.0
4,31,Pepsi,0.0,0.0,1.0
5,28,Coke,1.0,0.0,0.0


In [41]:
df.corr(method='pearson',numeric_only=True)

Unnamed: 0,Age,Sof_Drink_Preference_Coke,Sof_Drink_Preference_Mountain_dew,Sof_Drink_Preference_Pepsi
Age,1.0,-0.83724,0.394132,0.576439
Sof_Drink_Preference_Coke,-0.83724,1.0,-0.447214,-0.707107
Sof_Drink_Preference_Mountain_dew,0.394132,-0.447214,1.0,-0.316228
Sof_Drink_Preference_Pepsi,0.576439,-0.707107,-0.316228,1.0


**Result :** From the Pearson rank correlation  matrix, we can interpret the following:

- As age increases, preference for coke decreases
- As age increases, preference for Mountain Dew and Pepsi increases. But for Pepsi, it increases more.
- Relation between Pepsi and coke is negative, this means that those who like pepsi they don't like coke and vice versa.


Q6. A company is interested in examining the relationship between the number of sales calls made per day
and the number of sales made per week. The company collected data on both variables from a sample of
30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

Ans. Solution using sample data created is as follows:

In [33]:
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Generate sample data
num_samples = 30
sales_calls_per_day = np.random.randint(10, 50, num_samples)  # Simulated sales calls per day (10 to 49 calls)
sales_per_week = (sales_calls_per_day * 0.6 + np.random.normal(0, 5, num_samples)).astype(int)  # sales per week 

# Create a DataFrame
data = {
    "Sales Calls per Day": sales_calls_per_day,
    "Sales per Week": sales_per_week
        }

df=pd.DataFrame(data)
df.head()


Unnamed: 0,Sales Calls per Day,Sales per Week
0,48,36
1,38,21
2,24,14
3,17,3
4,30,15


In [34]:
df.corr(method='pearson')

Unnamed: 0,Sales Calls per Day,Sales per Week
Sales Calls per Day,1.0,0.849273
Sales per Week,0.849273,1.0


**Result :** Here, the correlation is positive. This means that when the number of sales calls per day goes up, the amount of sales per week also increases.