# <b>Hypothesis Testing</b>

Conducting a hypothesis test to determine <b>whether there is a significant difference in video views for verified versus unverified accounts. </b>

### <b>Data Exploration</b>

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
# Loading cleaned data set
data = pd.read_csv("TikTokCleaned.csv")

In [4]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [5]:
data.describe()

Unnamed: 0.1,Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9541.5,9542.5,5624840000.0,32.423811,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5509.220604,5509.220604,2537030000.0,16.22647,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,0.0,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4770.75,4771.75,3425100000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9541.5,9542.5,5609500000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14312.25,14313.25,7840823000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19083.0,19084.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19084 entries, 0 to 19083
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                19084 non-null  int64  
 1   #                         19084 non-null  int64  
 2   claim_status              19084 non-null  object 
 3   video_id                  19084 non-null  int64  
 4   video_duration_sec        19084 non-null  int64  
 5   video_transcription_text  19084 non-null  object 
 6   verified_status           19084 non-null  object 
 7   author_ban_status         19084 non-null  object 
 8   video_view_count          19084 non-null  float64
 9   video_like_count          19084 non-null  float64
 10  video_share_count         19084 non-null  float64
 11  video_download_count      19084 non-null  float64
 12  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(4), object(4)
memory usage: 1.9+ MB


### <b>Descriptive Statistics</b>

Since we are interested in two variables <b>video_view_count</b> and <b>verified_status</b>, hence performing descriptive statistics 

In [32]:
data.groupby(data['verified_status'])['video_view_count'].mean()

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

The mean number of video view count of non verfied users is significantly higher than those of verified user

### <b>Hypothesis Test</b>

<b>Null Hypothesis : </b>There is no significant difference in video views for verified versus unverified accounts (Any difference observed is due to sampling variability)<br>
<b>Alternate Hypothesis : </b>There is a significant difference in video views for verified versus unverified accounts (The difference is an actual difference and not by chance)

Significant Level choosen is 5%

In [35]:
significant_level = 0.05

In [30]:
verified = data[data['verified_status']=='verified']
nonverified = data[data['verified_status']=='not verified']

Conducting two-sample t-test to compare mean

In [33]:
p_value = stats.ttest_ind(a=verified['video_view_count'],b=nonverified['video_view_count'],equal_var=False)

In [34]:
p_value

TtestResult(statistic=-25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

p_value 0.0260 (2.60%) is much smaller than significant_value 0.05 (5%) <br> <b>Hence we reject the null hypothesis</b>

<b>There is a significant difference in video views for verified versus unverified accounts. The difference we observed during descriptive statistics is actual difference and not by chance</b>