In [1]:
import pandas as pd
import scipy.stats as stats
from math import sqrt
from statsmodels.formula.api import ols
import random
import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load the datasets
df = pd.read_csv('../Lab 1/spotify.csv')
df_t = pd.read_csv('../Lab 1/StateTestingDetails.csv')
df_p = pd.read_csv('../Lab 1/population_india_census2011.csv')
df_n = pd.read_csv('../Lab 1/netflix_titles.csv')

### 1. z-tests on Spotify dataset

#### Calculation of mean acousticness value for the population (2018-2020)

In [3]:
print(f'Population parameters')
df1 = df[df['year']!=2020]
df1 = df1[df1['year']>= 2018]
pop_num = df1.shape[0]
pop_mean  = df1['acousticness'].mean()
pop_std = df1['acousticness'].std()

print(f"Size = {pop_num}, Mean = {pop_mean}, Std = {pop_std}")

Population parameters
Size = 4052, Mean = 0.27276311065399717, Std = 0.2831018020164635


#### Calculation of mean acousticness value and standard deviation of random sample (300 songs) from Yr 2020

In [4]:
print(f'Sample statistics: Yr 2020')
samp_num = 300

df1 = df[df['year']==2020]
df1 = df1.sample(random_state=42, n=samp_num)
samp_mean = df1['acousticness'].mean()
samp_std = df1['acousticness'].std()

print(f'Size = {samp_num}, Mean = {samp_mean}, Std: {samp_std}')

Sample statistics: Yr 2020
Size = 300, Mean = 0.224682919, Std: 0.24729447700072327


A one-tailed test is when the critical area of a distribution is one-sided (either greater than or less than a certain value, but not both). If the sample being tested falls into the one-sided critical area, the alternative hypothesis will be accepted..

#### Inference

In [5]:
print('Left tailed z test')
alpha  = 0.05

print("Null Hypothesis H0 : μx = μy")
print('Alternate Hypotheis H1 : μx < μy')
print("Alpha value is :", alpha)

actual_z = abs(stats.norm.ppf(alpha))
hypo_z = (samp_mean - pop_mean)/(pop_std/sqrt(samp_num))

print('Critical value (actual_z):',-actual_z)
print('Test statistic (hypo_z):',hypo_z,'\n')

if hypo_z < -(actual_z):
    print("Reject Null Hypothesis")
else:
    print("Failed to reject Null hypothesis")

Left tailed z test
Null Hypothesis H0 : μx = μy
Alternate Hypotheis H1 : μx < μy
Alpha value is : 0.05
Critical value (actual_z): -1.6448536269514729
Test statistic (hypo_z): -2.9416038396509134 

Reject Null Hypothesis


**Null hypothesis**: Acoustic values in songs of 2020 is similar to the population (2018-2020) values 
  
  H0: μx = μy
  
**Alternative hypothesis**: Acoustic values in songs of 2020 is lesser than the population
  
  H1: μx < μy

As we do have the variance of the population, we employ a left tailed z-test to suggest whether there is a significant reduction in the acoustic effect of songs in the year 2020. As the test statistic is less than critical value (it lies in the 5% of the leftmost region).

**Hence H0 is rejected**

***

#### Calculation of mean song title length for the population (2018-2020)

In [63]:
df['name_len'] = df['name'].str.split().str.len()
df1 = df[df['year']!=2020]
df1 = df1[df1['year']>= 2018]

print(f'Population parameters')
pop_num = df1.shape[0]
pop_mean  = df1['name_len'].mean()
pop_std = df1['name_len'].std()

print(f"Size = {pop_num}, Mean = {pop_mean}, Std = {pop_std}")

Population parameters
Size = 4052, Mean = 3.0772458045409676, Std = 2.137856988061137


#### Calculation of mean song title length of random sample (300 songs) from Yr 2000

In [64]:
print(f'Sample statistics: Yr 2000')
samp_num = 300

df1 = df[df['year']==2020]
df1 = df1.sample(random_state=42, n=samp_num)
samp_mean = df1['name_len'].mean()
samp_std = df1['name_len'].std()

print(f'Size = {samp_num}, Mean = {samp_mean}, Std: {samp_std}')

Sample statistics: Yr 2000
Size = 300, Mean = 3.3366666666666664, Std: 2.3110876174376878


A two-tailed test is when critical area of a distribution is two-sided and tests whether a sample is greater or less than a range of values. If the sample being tested falls into either of the critical areas, the alternative hypothesis is accepted.

#### Inference

In [65]:
print('2 tailed z test')
# critical significance level
alpha  = 0.01

print("Null Hypothesis H0 : μx = μy")
print('Alternate Hypotheis H1 : μx < μy')
print("Alpha value is :", alpha)

actual_z = abs(stats.norm.ppf(alpha/2))
hypo_z = (samp_mean - pop_mean)/(pop_std/sqrt(samp_num))

print('Critical value (actual_z):',actual_z)
print('Test statistic (hypo_z):',hypo_z, '\n')

if hypo_z >= actual_z or hypo_z <=-(actual_z):
    print("Reject Null Hypothesis")
else:
    print("Failed to reject Null hypothesis")

2 tailed z test
Null Hypothesis H0 : μx = μy
Alternate Hypotheis H1 : μx < μy
Alpha value is : 0.01
Critical value (actual_z): 2.575829303548901
Test statistic (hypo_z): 2.1017781650237373 

Failed to reject Null hypothesis


**Null hypothesis**: Legth of title of songs of 2020 is similar to the population 2018-2020 values 
  
  H0: μx = μy
  
**Alternative hypothesis**:  Legth of title of songs of 2020 is different from the population
  
  H1: μx != μy

As we have the variance of the population, we employ a 2 tailed z-test with 0.01 significance level to suggest whether there is a significant difference in the title length of songs in the year 2020. Here we choose alpha = 0.01 critical significance value as the outcome of the results of this test would help artists to decide song names in future as 2020 songs were very popular. It directly affects the artists' earning and audience appeal. As the test statistic is less than negative critical value (it lies in the 2.5% of the leftmost region).

**Hence H0 NOT rejected**
****

### 2. t-tests on Spotify dataset

#### Calculation of mean popularity and standard deviation of songs before 2000 from a sample size of 50

In [6]:
print(f'Sample 1 statistics: Before Yr 2000')
samp1_num = 50
ddof1 = samp1_num - 1

df1 = df[df['year']<2000]
df1 = df1.sample(random_state=42, n=samp1_num)
samp1_mean = df1['popularity'].mean()
samp1_std = df1['popularity'].std()

print(f'Size = {samp1_num}, Mean = {samp1_mean}, Std: {samp1_std}')

Sample 1 statistics: Before Yr 2000
Size = 50, Mean = 25.26, Std: 18.08010520106652


#### Calculation of mean popularity and standard deviation of songs after 2000 from a sample size of 50

In [7]:
print(f'Sample 2 statistics: After Yr 2000')
samp2_num = 50
ddof2 = samp2_num - 1

df1 = df[df['year']>=2000]
df1 = df1.sample(random_state=42, n=samp2_num)
samp2_mean = df1['popularity'].mean()
samp2_std = df1['popularity'].std()

print(f'Size = {samp2_num}, Mean = {samp2_mean}, Std: {samp2_std}')

Sample 2 statistics: After Yr 2000
Size = 50, Mean = 52.42, Std: 15.01983722295557


#### Inference

In [8]:
print('2 independent sample Right-tailed t test')
alpha  = 0.05
ddof = ddof1 + ddof2

print("Null Hypothesis H0 : μx = μy")
print('Alternate Hypotheis H1 : μx != μy')
print("Alpha value is :", alpha)

# pooled std dev (same population)
std_pool = sqrt((ddof1*(samp1_std ** 2) + ddof2*(samp2_std ** 2))/ddof)
print(f'Pooled variance: {std_pool}')
actual_z = abs(stats.t.ppf(alpha, ddof))
hypo_z = (samp1_mean - samp2_mean)/(std_pool*sqrt((1/samp1_num + 1/samp2_num)))

print('Critical value (actual_z):',actual_z)
print('Test statistic (hypo_z):',hypo_z, '\n')

if hypo_z >= actual_z:
    print("Reject Null Hypothesis")
else:
    print("Failed to reject Null hypothesis")

2 independent sample Right-tailed t test
Null Hypothesis H0 : μx = μy
Alternate Hypotheis H1 : μx != μy
Alpha value is : 0.05
16.62055525976365
Critical value (actual_z): 1.6605512170440575
Test statistic (hypo_z): -8.170605486854903 

Failed to reject Null hypothesis


**Null hypothesis**: There is no significant difference in the mean popularity of songs before and after year 2000
  
  H0: μx = μy
  
**Alternative hypothesis**: There is an increase in the mean popularity of songs after year 2000
  
  H1: μx > μy

As the 2 sample are independent and we don't know the variances, we employ a right tailed t-test to suggest whether there is a significant increase in the mean popularity of songs after year 2000 compared to before. Here the population is same, so pooled variance is used and DoF = n1 + n2 - 2. As the test statistic is less than critical value (it lies in the 95% region).

**Hence H0 NOT rejected**

***

### 3. chi-squared test on Netflix dataset

#### Checking if there is any dependence between type of show and target audience (children/adults)

In [69]:
df_n = df_n[df_n['rating'].notna()]
df_n = df_n.reset_index(drop=True)
df_n['aud_cat'] = None

for index, row in df_n.iterrows():
    if row['rating'] in ['TV-14', 'TV-PG', 'R', 'PG-13']:
        df_n['aud_cat'].iloc[index] = 'child'
    else:
        df_n['aud_cat'].iloc[index] = 'adult'
        

In [70]:
contigency_tab = pd.crosstab(df_n['type'], df_n['aud_cat']) 
contigency_tab

aud_cat,adult,child
type,Unnamed: 1_level_1,Unnamed: 2_level_1
Movie,2546,2826
TV Show,1446,962


#### Inference

In [71]:
alpha = 0.05
chi_stat, p, dof, expected = stats.chi2_contingency(contigency_tab)
print(f'Test statistic = {chi_stat}\nP-value = {p}\nDoF = {dof}')
if p < alpha:
    print("Reject Null Hypothesis")
else:
    print("Failed to reject Null hypothesis")

Test statistic = 106.09479267108279
P-value = 7.02910957152232e-25
DoF = 1
Reject Null Hypothesis


**Null hypothesis**: There is no dependency between type of show (TV/movie) and target audience (children/adults).
  
  H0: μx = μy
  
**Alternative hypothesis**: There is dependence between show type and audience targetted
  
  H1: μx != μy

Since the 2 are categorical variables we employ the Chi-squared test for association/independence. As obtained p value is very low (< alpha)

**Hence H0 rejected**

***

### 4. ANOVA on covid dataset

#### 1 way anova test to study if density of a region affects the count of covid positive cases 

In [77]:
df_t['Positive'] = df_t['Positive'] + 1
stateMedian = df_t.groupby('State')[['Positive']].median().reset_index().rename(columns={'Positive':'Median'})
for index,row in df_t.iterrows():
    if pd.isnull(row['Positive']):
        df_t['Positive'][index] = int(stateMedian['Median'][stateMedian['State'] == row['State']])
df_t['Positive'].sort_values()
data=pd.merge(df_t, df_p, on='State')
data['Positive'].sort_values()

# Create density groups
def density(data):
    data['density_grp']=0
    for index,row in data.iterrows():
        status=None
        i=row['Density'].split('/')[0]
        try:
            if (',' in i):
                i=int(i.split(',')[0]+i.split(',')[1])
            elif ('.' in i):
                i=round(float(i))
            else:
                i=int(i)
        except ValueError as err:
            pass
        try:
            if (0<i<=300):
                status = 'Dense1'
            elif (300<i<=600):
                status = 'Dense2'
            elif (600<i<=900):
                status = 'Dense3'
            else:
                status = 'Dense4'
        except ValueError as err:
            pass
        data['density_grp'].iloc[index]=status
    return data

data['Positive'].sort_values()
# Map each state as per its density group
data=density(data)
stateDensity=data[['State','density_grp']].drop_duplicates().sort_values(by='State')

In [78]:
df = pd.DataFrame({'Dense1':random.sample(list(data['Positive'][data['density_grp']=='Dense1']), 10),
                      'Dense2':random.sample(list(data['Positive'][data['density_grp']=='Dense1']), 10),
                      'Dense3':random.sample(list(data['Positive'][data['density_grp']=='Dense1']), 10),
                      'Dense4':random.sample(list(data['Positive'][data['density_grp']=='Dense1']), 10)})

# BoxCox Transformation to bring the data to close to Gaussian Distribution
df['Dense1'],fitted_lambda = stats.boxcox(df['Dense1'])
df['Dense2'],fitted_lambda = stats.boxcox(df['Dense2'])
df['Dense3'],fitted_lambda = stats.boxcox(df['Dense3'])
df['Dense4'],fitted_lambda = stats.boxcox(df['Dense4'])

#### Inference

In [79]:
F, p = stats.f_oneway(df['Dense1'], df['Dense2'], df['Dense3'], df['Dense4'])
# Seeing if the overall model is significant
print('F-Statistic = %.3f, p = %f' % (F, p))

F-Statistic = 7.499, p = 0.000505


In [80]:
# Rearrange DataFrame
newDf = df.stack().to_frame().reset_index().rename(columns={'level_1':'density_grp', 0:'Count'})
del newDf['level_0']
model = ols('Count ~ C(density_grp)', newDf).fit()
print("Model summary")
print(model.summary())
# Create ANOVA table
print("Anova table")
res = sm.stats.anova_lm(model, typ= 2)
res

Model summary
                            OLS Regression Results                            
Dep. Variable:                  Count   R-squared:                       0.385
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     7.499
Date:                Mon, 08 Feb 2021   Prob (F-statistic):           0.000505
Time:                        13:37:37   Log-Likelihood:                -130.84
No. Observations:                  40   AIC:                             269.7
Df Residuals:                      36   BIC:                             276.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept 

Unnamed: 0,sum_sq,df,F,PR(>F)
C(density_grp),1015.45164,3.0,7.499384,0.000505
Residual,1624.855985,36.0,,


In [81]:
# Critical significance level
alpha = 0.01
if p < alpha:
    print("Reject Null Hypothesis")
else:
    print("Failed to reject Null hypothesis")

Reject Null Hypothesis


**Null hypothesis**: There is no effect of location density on covid cases
  
  H0: μx = μy
  
**Alternative hypothesis**: Population location density affects count of covid cases
  
  H1: μx != μy

So the F-statistic and p-values indicate that there is an overall significant effect of density_groups on positive corona cases. 
Here we choose alpha = 0.01 critical significance value as the outcome of the results of this test would affect the entire populations' life and way of living

**Hence H0 rejected**