## Basics of Statistical Analysis


https://www.sfu.ca/~mjbrydon/tutorials/BAinPy/01_intro.html

In [None]:
import pandas as pd
import seaborn as sns
import os

In [None]:
os.getcwd()

In [None]:
data = pd.read_json('C:/Users/MKINA18/Desktop/Advanced Data Analysis for Social Sciences/Datasets/twitter_data_ML.json')
data.columns

In [None]:
data

In [None]:
data.describe()

In [None]:
data = data[['emoji_n', 'tweet_length', 'tweet_unique_length',
       'province_codes', 'female', 'followers_count', 'following_count',
       'age_group', 'university_rank', 'university', 'n_pos_sent', 'rank_dummy', 
       'big_cities', 'avr_w_length', 'positive']]

I recently produced two variables: ```avr_w_length, positive```

Following command line shows the operation.

You should not run the following command line, since these variables are already defined.

def average(numbers):
    if len(numbers)>0:
        return sum(numbers)/len(numbers)
    else:
        return 0

avr_w_length = []

tweets = data['tweets'].to_numpy()

for i in range(len(tweets)):
    words = str(tweets[i]).split()
    lengths = [len(word) for word in words]
    avr_w_length.append(average(lengths))
    
data['avr_w_length'] = pd.DataFrame(avr_w_length)

#------------------------------------------------------------------

data['positive'] = pd.cut(data['n_pos_sent'],[0,3,5],labels=[0,1])

In [None]:
data['female']
#male=0, female=1

### histogram

In [None]:
data['university_rank'].hist()

In [None]:
sns.displot(x='university_rank', row='female', data=data, linewidth=0, kde=True);

### overlaying kernel density plots

In [None]:
sns.kdeplot(x='university_rank', hue='female', data=data, shade=True)

### t-test

In [None]:
female_rank = data[data['female'] == 1]['university_rank']
male_rank = data[data['female'] == 0]['university_rank']

In [None]:
female_rank

In [None]:
from scipy import stats

In [None]:
#Whether variances are equal or not, null hypothesis is that they are equal
stats.levene(female_rank, male_rank)

In [None]:
#pip install statsmodels

In [None]:
import statsmodels.stats.api as sms

In [None]:
model = sms.CompareMeans.from_data(data[data['female'] == 1]['university_rank'], data[data['female'] == 0]['university_rank'])
model.summary(usevar='pooled') #pooled or unequal, you have two options

The difference between males and females are not statistically significant at 0.05 p_value. However, the interpretation might change if we reconsider the output at 0.1 p_value.

### Cross-tabs

In [None]:
#age ranges: -18,19-29,30-39,40+

contab_freq = pd.crosstab(
    data['female'],
    data['age_group'],
    margins = False
    #, normalize='index'
   )
contab_freq

In [None]:
chi = stats.chi2_contingency(contab_freq)
chi

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

In [None]:
pd.DataFrame(chi[3])

### Correlation and scatterplots

In [None]:
import seaborn as sns
sns.scatterplot(x="avr_w_length", y="university_rank", data=data)

In [None]:
ax = sns.scatterplot(x="avr_w_length", y="university_rank", data=data)
ax.set_title("Average word length vs University ranking")
ax.set_xlabel("the average word length used in tweets")

In [None]:
sns.lmplot(x="avr_w_length", y="university_rank", data=data)

In [None]:
sns.lmplot(x="avr_w_length", y="university_rank",hue="big_cities", data=data)

In [None]:
from scipy import stats
stats.pearsonr(data['avr_w_length'], data['university_rank'])
#coefficient correlation and p-value

In [None]:
data.drop("university", axis=1).corr()

In [None]:
#Seaborn heatmap
sns.heatmap(data.drop("university", axis=1).corr())

In [None]:
sns.heatmap(data.drop("university", axis=1).corr(), cmap="YlGnBu", annot=True, annot_kws={"size": 5})

### Simple regression analysis

In [None]:
data

In [None]:
import statsmodels.api as sm

In [None]:
y = data['university_rank']
X = data['avr_w_length']

In [None]:
X

In [None]:
X = sm.add_constant(X)
#In OLS models, we have always Beta zero coefficient as an explanatory variable, which is known as the intercept of the line.

In [None]:
X

In [None]:
model = sm.OLS(y, X, missing='drop')
model_result = model.fit()
model_result.summary()

***R-squared:***

R-squared (R²) is a measure of how well the independent variables explain the variability of the dependent variable. It ranges from 0 to 1, where 1 indicates a perfect fit. In this case, the R-squared is 0.001, suggesting that the independent variable(s) explain a very small proportion of the variability in the dependent variable.

***Adjusted R-squared:***

Similar to R-squared but adjusts for the number of predictors in the model. It penalizes the addition of irrelevant variables that do not improve the model's explanatory power. In this case, the adjusted R-squared is also very close to 0.

***F-statistic:

The F-statistic tests the overall significance of the regression model. A larger F-statistic suggests a more significant relationship between the independent and dependent variables. Here, the F-statistic is 2.193.

***Prob (F-statistic):***

The probability associated with the F-statistic. If this probability (p-value) is less than a chosen significance level (commonly 0.05), you can reject the null hypothesis that all coefficients are equal to zero. In this case, the p-value is 0.139, which is greater than 0.05, suggesting that the model's overall significance is not strong.

***Log-Likelihood:***

The log-likelihood is a measure of how well the model explains the observed data. Lower values indicate a better fit. In this case, the log-likelihood is -25813.

***AIC (Akaike Information Criterion):***

AIC is a measure of the model's goodness of fit, considering the trade-off between the complexity of the model and its fit to the data. Lower AIC values indicate a better model fit. Here, the AIC is 5.163e+04.

***Df Residuals:***

Degrees of freedom of the residuals. It represents the number of observations minus the number of estimated parameters in the model. In this case, Df Residuals is 4102.

***BIC (Bayesian Information Criterion):***

Similar to AIC, BIC is another measure of the goodness of fit that penalizes model complexity. Lower BIC values indicate a better fit. Here, the BIC is 5.164e+04.

***Df Model:***

Degrees of freedom of the model, which is the number of predictors. In this case, Df Model is 1.

***Covariance Type:***

Specifies the type of covariance matrix used in the analysis. In this case, it's "nonrobust."

***P>|t| (P-value):***

The p-value associated with the t-value. It indicates the probability of observing a t-statistic as extreme as the one computed from the sample data, assuming that the null hypothesis is true.
[0.025 0.975]:

The confidence interval for the coefficients. In this case, it provides a 95% confidence interval for the intercept and avr_w_length.

***Omnibus, Durbin-Watson, Prob(Omnibus), Jarque-Bera, Skew, Kurtosis:***

These are statistical tests and measures related to the residuals (errors) of the model, assessing assumptions and goodness of fit.

***Cond. No. (Condition Number):***

A measure of multicollinearity in the model. High condition numbers may indicate multicollinearity among predictor variables.


### Regression diagnostics

#### Histogram of residuals

In [None]:
import seaborn as sns
sns.histplot(model_result.resid)

In [None]:
from scipy import stats
mu, std = stats.norm.fit(model_result.resid)
mu, std

In [None]:
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots()
# plot the residuals
sns.histplot(x=model_result.resid, ax=ax, stat="density", linewidth=0, kde=True)
ax.set(title="Distribution of residuals", xlabel="residual")

# plot corresponding normal curve
xmin, xmax = plt.xlim() # the maximum x values from the histogram above
x = np.linspace(xmin, xmax, 100) # generate some x values
p = stats.norm.pdf(x, mu, std) # calculate the y values for the normal curve
sns.lineplot(x=x, y=p, color="orange", ax=ax)
plt.show()

#### Boxplot of residuals

In [None]:
sns.boxplot(x=model_result.resid, showmeans=True)

#### Quantile - quantile plot of residuals

In [None]:
sm.qqplot(model_result.resid, line="s")

In [None]:
fig = sm.qqplot(model_result.resid, line="s")
plt.show()

#### Fit plot

In [None]:
fig = sm.graphics.plot_fit(model_result, 1, vlines=False)
plt.show()

In [None]:
model_result.fittedvalues

In [None]:
Y_max = y.max()
Y_min = y.min()

ax = sns.scatterplot(x=model_result.fittedvalues, y=y)
ax.set(ylim=(Y_min, Y_max))
ax.set(xlim=(Y_min, Y_max)) #revise Y_min and Y_max with 450 and 550
ax.set_xlabel("Predicted value of Univ. Ranking")
ax.set_ylabel("Observed value of Univ. Ranking")

X_ref = Y_ref = np.linspace(Y_min, Y_max, 100)
plt.plot(X_ref, Y_ref, color='red', linewidth=1)
plt.show()

We can say that we have a good-fitting estimation, but do not have a great predictor.

### Multiple regression models

In [None]:
y = data['university_rank']
X = data[['emoji_n', 'tweet_length', 'tweet_unique_length', 'positive']]
X = sm.add_constant(X)

In [None]:
ks = sm.OLS(y, X)
ks_res =ks.fit()
ks_res.summary()

In [None]:
y = data['university_rank']
X = data[['emoji_n', 'tweet_length', 'tweet_unique_length', 'positive', 'followers_count',
          'following_count', 'age_group', 'female', 'avr_w_length']]
X = sm.add_constant(X)

In [None]:
ks = sm.OLS(y, X)
ks_res =ks.fit()
ks_res.summary()

In [None]:
import statsmodels.formula.api as smf
ksf =  smf.ols('university_rank ~ emoji_n + tweet_length + tweet_unique_length + positive + followers_count + following_count + age_group + female + avr_w_length', data=data)
ksf_res = ksf.fit()
ksf_res.summary()

### Checking for colinearity

In [None]:
import seaborn as sns
sns.pairplot(X[['emoji_n', 'tweet_length', 'tweet_unique_length', 'followers_count',
          'following_count', 'age_group', 'avr_w_length']])

In [None]:
round(data.drop("university", axis=1).corr(),2)

In [None]:
X = X.drop(columns = ['tweet_length', 'followers_count'], inplace = False)

In [None]:
mod1 = sm.OLS(y, X)
mod1_res = mod1.fit()
mod1_res.summary()

### Regression diagnostics again

In [None]:
from scipy import stats
sns.distplot(mod1_res.resid, fit=stats.norm)

In [None]:
sns.boxplot(mod1_res.resid, showmeans=True)

In [None]:
fig = sm.qqplot(mod1_res.resid, line='s')
plt.show()

In [None]:
pd.DataFrame({'fit': mod1_res.fittedvalues, 'y':  y})

In [None]:
pd.DataFrame({'fit': mod1_res.fittedvalues, 'y':  y}).corr()

In [None]:
import matplotlib.pyplot as plt
import numpy as np

Y_max = y.max()
Y_min = y.min()

ax = sns.scatterplot(x = mod1_res.fittedvalues, y=y)
ax.set(ylim=(Y_min, Y_max))
ax.set(xlim=(Y_min, Y_max))
ax.set_xlabel("Predicted value of university rank")
ax.set_ylabel("Observed value of university rank")

X_ref = Y_ref = np.linspace(Y_min, Y_max, 100)
plt.plot(X_ref, Y_ref, color='red', linewidth=1)
plt.show()

### Standardize variables

***standardized_a = ( a - a.mean() ) / a.std()***

In [None]:
from scipy import stats
y_norm = pd.Series(stats.zscore(y), name=y.name)
y_norm

In [None]:
X_norm = X.loc[:, X.columns != "const"]
X_norm = pd.DataFrame(stats.zscore(X_norm))
X_norm = sm.add_constant(X_norm)
X_norm.columns = X.columns
check = pd.concat([round(X_norm.mean(axis=0)), round(X_norm.std(axis=0))], axis=1)
check.columns=["mean", "std dev"]
check

In [None]:
modstd = sm.OLS(y_norm, X_norm)
modstd_res = modstd.fit()
modstd_res.summary()

In [None]:
coeff = mod1_res.params
#coeff = coeff.iloc[(coeff.abs()*-1.0).argsort()] in order to sort coefficients
sns.barplot(x=coeff.values, y=coeff.index, orient='h')

In [None]:
coeff = modstd_res.params
sns.barplot(x=coeff.values, y=coeff.index, orient='h')

### Assumptions of OLS

Ordinary Least Squares (OLS) regression relies on several key assumptions for the validity of its results. Violations of these assumptions may affect the reliability of the estimates. The basic assumptions of OLS are:

***Linearity:***

The relationship between the independent and dependent variables is assumed to be linear. The model assumes that changes in the independent variables have a constant effect on the dependent variable.

***Independence:***

Observations in the dataset should be independent of each other. This means that the value of the dependent variable for one observation should not be influenced by the values of the dependent variable for other observations.

***Homoscedasticity:***

The variance of the residuals (the differences between observed and predicted values) should be constant across all levels of the independent variables. In simpler terms, the spread of residuals should be roughly consistent throughout the range of predicted values.

***Normality of Residuals:***

The residuals (the differences between observed and predicted values) are assumed to be normally distributed. This assumption is more critical for smaller sample sizes, as larger samples tend to be less sensitive to departures from normality.

***No Perfect Multicollinearity:***

There should not be perfect linear relationships among the independent variables. High multicollinearity (correlation) among independent variables can lead to instability in coefficient estimates.

***No Endogeneity:***

The independent variables are assumed to be exogenous, meaning they are not correlated with the error term. Endogeneity, where an independent variable is correlated with the error term, can bias coefficient estimates.

***No Autocorrelation:***

The residuals are assumed to be independent of each other (no autocorrelation). Autocorrelation occurs when there is a correlation between the residuals at different points in time or space.

***Additivity:***

The model assumes that the effect of changes in an independent variable on the dependent variable is consistent across all levels of other independent variables.
