# Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

In [6]:
import pandas as pd
data = {'study_time':[10, 8, 12, 14, 7], 'exam_scores':[85, 78, 90, 92, 75]}
df = pd.DataFrame(data)
df

Unnamed: 0,study_time,exam_scores
0,10,85
1,8,78
2,12,90
3,14,92
4,7,75


In [10]:
df.corr( method = 'pearson')

Unnamed: 0,study_time,exam_scores
study_time,1.0,0.981551
exam_scores,0.981551,1.0


Interpretation of the Pearson correlation coefficient:

The Pearson correlation coefficient, denoted as r, measures the strength and direction of the linear relationship between the two variables.

r can range from -1 to 1:
If r is close to 1, it indicates a strong positive linear relationship, suggesting that as students spend more time studying, their exam scores tend to increase.

If r is close to -1, it indicates a strong negative linear relationship, suggesting that as students spend more time studying, their exam scores tend to decrease.

If r is close to 0, it indicates a weak or no linear relationship between the two variables.

In the context of interpreting the result, you would need to consider the magnitude and sign of r along with the domain knowledge to draw meaningful conclusions. A positive correlation suggests that more study time is associated with higher exam scores, while a negative correlation suggests the opposite. The exact interpretation may depend on the context and the strength of the correlation.




# Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

Spearman's rank correlation coefficient (ρ) is a non-parametric measure of the strength and direction of the monotonic relationship between two variables. It is suitable for assessing the relationship between variables when the data may not meet the assumptions of linearity required by Pearson's correlation coefficient.

To calculate Spearman's rank correlation, follow these steps:

Rank the data for both variables.
Calculate the difference between the ranks for each pair of data points.
Square these differences.
Sum up the squared differences.
Use the formula for Spearman's rank correlation:

In [13]:
import numpy as np
from scipy.stats import spearmanr

# Sample data for sleep hours and job satisfaction
sleep_hours = [6, 7, 5, 8, 7]
job_satisfaction = [5, 8, 4, 9, 7]

# Calculate Spearman's rank correlation coefficient
spearman_corr, _ = spearmanr(sleep_hours, job_satisfaction)

print("Spearman's Rank Correlation Coefficient:", spearman_corr)


Spearman's Rank Correlation Coefficient: 0.9746794344808963


In [17]:
#alternate method
import numpy as np
from scipy.stats import spearmanr

# Sample data for sleep hours and job satisfaction
data = {'sleep_hours':[6, 7, 5, 8, 7],
        'job_satisfaction':[5, 8, 4, 9, 7]}

df = pd.DataFrame(data)

# Calculate Spearman's rank correlation coefficient
spearman_corr, _ = spearmanr(df['sleep_hours'], df['job_satisfaction'])

print("Spearman's Rank Correlation Coefficient:", spearman_corr)


Spearman's Rank Correlation Coefficient: 0.9746794344808963


Spearman's rank correlation coefficient (ρ) ranges from -1 to 1.
If ρ is close to 1, it indicates a strong positive monotonic relationship, suggesting that as individuals get more sleep, their job satisfaction tends to increase.
If ρ is close to -1, it indicates a strong negative monotonic relationship, suggesting that as individuals get more sleep, their job satisfaction tends to decrease.
If ρ is close to 0, it indicates a weak or no monotonic relationship between the two variables.
In this context, a positive Spearman's rank correlation suggests that individuals who get more sleep tend to report higher job satisfaction, while a negative correlation suggests the opposite. The exact interpretation may depend on the context and the strength of the correlation.

# Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [28]:
import pandas as pd

# Sample data for exercise hours per week and BMI
data = {'exercise_hours':[2, 3, 4, 3, 2, 1, 4, 5, 2, 3, 4, 5,1],
        'bmi':[30.5, 24.0, 23.3, 28.2, 26.7, 25.7, 24.2, 30.9, 28.3, 27.3, 25.9, 23.7, 29.6]}

df = pd.DataFrame(data)

# Calculate Pearson correlation coefficient
pearson_corr = df['exercise_hours'].corr(df['bmi'])

# Calculate Spearman's rank correlation coefficient
spearman_corr = df['exercise_hours'].corr(df['bmi'], method='spearman')

print("Pearson Correlation Coefficient:", pearson_corr)
print("Spearman's Rank Correlation Coefficient:", spearman_corr)


Pearson Correlation Coefficient: -0.31923163419594924
Spearman's Rank Correlation Coefficient: -0.3333974297349129


# Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [31]:
#first method
import numpy as np
from scipy.stats import pearsonr

# Sample data for TV hours per day and physical activity level
tv_hours = [2, 3, 4, 3, 2, 1, 4, 5, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 3, 2, 1, 4, 5, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 3, 2]
physical_activity = [3, 2, 4, 3, 2, 4, 2, 1, 3, 4, 2, 1, 4, 3, 2, 4, 1, 4, 3, 2, 3, 1, 2, 4, 3, 2, 1, 4, 2, 3, 4, 2, 1, 3, 4, 2, 1, 4, 3, 2, 4, 1, 3, 2, 4, 2, 1, 3, 4]

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(tv_hours, physical_activity)

print("Pearson Correlation Coefficient:", pearson_corr)


Pearson Correlation Coefficient: -0.4148152572016614


In [32]:
#Second method
import numpy as np
from scipy.stats import pearsonr

# Sample data for TV hours per day and physical activity level
data = {'tv_hours':[2, 3, 4, 3, 2, 1, 4, 5, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 3, 2, 1, 4, 5, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 3, 2],
'physical_activity':[3, 2, 4, 3, 2, 4, 2, 1, 3, 4, 2, 1, 4, 3, 2, 4, 1, 4, 3, 2, 3, 1, 2, 4, 3, 2, 1, 4, 2, 3, 4, 2, 1, 3, 4, 2, 1, 4, 3, 2, 4, 1, 3, 2, 4, 2, 1, 3, 4]}

df = pd.DataFrame(data)
pearson_corr = df.corr(method = 'pearson')

print("Pearson Correlation Coefficient:", pearson_corr)


Pearson Correlation Coefficient:                    tv_hours  physical_activity
tv_hours           1.000000          -0.414815
physical_activity -0.414815           1.000000


# Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

# Age(Years)   Soft drink Preference
# 25           Coke 
# 42           Pepsi 
# 37           Mountain dew  
# 19           Coke
# 31           Pepsi
# 28           Coke

In [49]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = {'Age(years)':[25,42,37,9,31,28],'Soft Drink Preference':['Coke','Pepsi','Moutain Dew','Coke','Pepsi','Coke']}
df = pd.DataFrame(data)
Encoder = OneHotEncoder()
Encoded = Encoder.fit_transform(df[['Soft Drink Preference']])
encoded_df = pd.DataFrame(Encoded.toarray(),columns = Encoder.get_feature_names_out())

In [50]:
encoded_df

Unnamed: 0,Soft Drink Preference_Coke,Soft Drink Preference_Moutain Dew,Soft Drink Preference_Pepsi
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [51]:
df = pd.concat([df[['Age(years)']],encoded_df],axis = 1)

In [52]:
df

Unnamed: 0,Age(years),Soft Drink Preference_Coke,Soft Drink Preference_Moutain Dew,Soft Drink Preference_Pepsi
0,25,1.0,0.0,0.0
1,42,0.0,0.0,1.0
2,37,0.0,1.0,0.0
3,9,1.0,0.0,0.0
4,31,0.0,0.0,1.0
5,28,1.0,0.0,0.0


In [54]:
Pearson_Correlation = df.corr(method = 'pearson')
Spearman_Correlation = df.corr(method = 'spearman')
Covariance = df.cov()

In [58]:
print(f"Pearson_Correlation: {Pearson_Correlation}")
print(f"Spearman_Correlation: {Spearman_Correlation}")
print(f"Covariance: {Covariance}")

Pearson_Correlation:                                    Age(years)  ...  Soft Drink Preference_Pepsi
Age(years)                           1.000000  ...                     0.530811
Soft Drink Preference_Coke          -0.766652  ...                    -0.707107
Soft Drink Preference_Moutain Dew    0.357143  ...                    -0.316228
Soft Drink Preference_Pepsi          0.530811  ...                     1.000000

[4 rows x 4 columns]
Spearman_Correlation:                                    Age(years)  ...  Soft Drink Preference_Pepsi
Age(years)                           1.000000  ...                     0.621059
Soft Drink Preference_Coke          -0.878310  ...                    -0.707107
Soft Drink Preference_Moutain Dew    0.392792  ...                    -0.316228
Soft Drink Preference_Pepsi          0.621059  ...                     1.000000

[4 rows x 4 columns]
Covariance:                                    Age(years)  ...  Soft Drink Preference_Pepsi
Age(years)           

# Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [59]:
import numpy as np
from scipy.stats import pearsonr

# Sample data for sales calls made per day and sales made per week
sales_calls_per_day = [10, 8, 12, 14, 7, 9, 11, 13, 8, 10, 12, 15, 7, 9, 11, 14, 8, 10, 13, 15, 7, 9, 12, 14, 8, 10, 11, 13, 7, 9]
sales_made_per_week = [5, 4, 6, 7, 3, 4, 5, 6, 4, 5, 6, 7, 3, 4, 5, 7, 4, 5, 6, 7, 3, 4, 6, 7, 4, 5, 5, 6, 3, 4]

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(sales_calls_per_day, sales_made_per_week)

print("Pearson Correlation Coefficient:", pearson_corr)


Pearson Correlation Coefficient: 0.9812430065166519
