# INFO2950 Phase 4: Linear Regression

In this notebook, we will
1. Perform linear regression on Instagram data vs college stats to see if there is a linear relationship. Specifically, follower count and follower percent increase vs. size and admission rate.
2. Perform linear regression on Instagram data vs categorical variables that we ignored before. Specifically, college ownership, region, and locale.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [2]:
data_dir = "../data"

## Merged Dataset

In [3]:
# Load merged dataset
instagram_details = pd.read_csv(os.path.join(data_dir, "instagram_details.csv"))
instagram_details.head()

Unnamed: 0,name,instagram,follower_curr,follower_mean,follower_med,follower_std,follower_min,follower_max,following_curr,following_mean,...,income_med,size,lat,lon,city,ownership,region,state,locale_type,locale_size
0,University of Chicago,uchicago,116800.0,102192.602041,100650.0,9675.450589,85100.0,116800.0,284.0,271.576531,...,47139.0,6600.0,41.787994,-87.599539,Chicago,private non-profit,great lakes,IL,city,large
1,Yale University,yale,490700.0,445869.550173,460900.0,38742.023912,369300.0,490700.0,249.0,244.269896,...,44004.0,5963.0,41.311158,-72.926688,New Haven,private non-profit,new england,CT,city,medium
2,Brown University,brownu,193000.0,185870.992366,187900.0,5583.446131,169000.0,193000.0,141.0,120.185751,...,82670.0,6752.0,41.82617,-71.40385,Providence,private non-profit,new england,RI,city,medium
3,Dartmouth College,dartmouthcollege,65400.0,58784.468193,58100.0,3262.492004,53200.0,65400.0,2278.0,2159.715013,...,68455.0,4312.0,43.704115,-72.289949,Hanover,private non-profit,new england,NH,town,small
4,Haverford College,haverfordedu,9063.0,7931.378109,8007.0,692.013379,6897.0,9063.0,207.0,169.621891,...,65396.0,1305.0,40.007452,-75.305207,Haverford,private non-profit,mid east,PA,suburb,large


In [4]:
# What are the columns again
instagram_details.columns

Index(['name', 'instagram', 'follower_curr', 'follower_mean', 'follower_med',
       'follower_std', 'follower_min', 'follower_max', 'following_curr',
       'following_mean', 'following_med', 'following_std', 'following_min',
       'following_max', 'posts_curr', 'posts_mean', 'posts_med', 'posts_std',
       'posts_min', 'posts_max', 'admission_rate', 'sat_score',
       'cost_attendance', 'income_avg', 'income_med', 'size', 'lat', 'lon',
       'city', 'ownership', 'region', 'state', 'locale_type', 'locale_size'],
      dtype='object')

In [5]:
# Drop @ivyleague for these analyses
instagram_details = instagram_details[instagram_details.instagram != "ivyleague"]

# Log does give some better results
instagram_details['log_follower_med'] = np.log(instagram_details['follower_med'])

## Multivariate regression on numerical data

In [6]:
def reg_stats(df, predictors, target):
    reg = LinearRegression().fit(df[predictors], df[target])
    for i in range(len(predictors)):
        print("{} coef: {:.6f}".format(predictors[i], reg.coef_[i]))
    print("r^2: {:.3f}".format(reg.score(df[predictors], df[target])))

In [7]:
reg_stats(instagram_details, ['admission_rate'], 'follower_med')

admission_rate coef: -182343.380366
r^2: 0.042


Predicting the median followers using just the college's admission rate does not give much insight, with a very low $r^2$.

In [8]:
reg_stats(instagram_details, ['size'], 'follower_med')

size coef: 2.551152
r^2: 0.018


Predicting the median followers using only size is even less useful.

In [9]:
reg_stats(instagram_details, ['admission_rate'], 'log_follower_med')
reg_stats(instagram_details, ['size'], 'log_follower_med')

admission_rate coef: -0.958731
r^2: 0.026
size coef: 0.000066
r^2: 0.273


Using the log of median followers improves $r^2$ drastically for size as a predictor, but not for admission rate.

In [10]:
reg_stats(instagram_details, ['size', 'admission_rate'], 'follower_med')
reg_stats(instagram_details, ['size', 'admission_rate'], 'log_follower_med')

size coef: 5.399882
admission_rate coef: -297206.463067
r^2: 0.107
size coef: 0.000095
admission_rate coef: -2.972966
r^2: 0.470


But using size and admission rate together as predictors, however, raises our $r^2$ to 0.107, or to 0.470 for the log version.

In [11]:
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med'], 'follower_med')
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med'], 'log_follower_med')

size coef: 3.607143
admission_rate coef: -265444.552864
income_med coef: -3.137739
r^2: 0.193
size coef: 0.000089
admission_rate coef: -2.871394
income_med coef: -0.000010
r^2: 0.489


With size, admission rate, and median income as predictors, we get $r^2$ to 0.193, or 0.489 for the log version.

In [12]:
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_med')
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'log_follower_med')

size coef: 4.108562
admission_rate coef: -232603.178019
income_med coef: -3.195766
cost_attendance coef: 0.679158
r^2: 0.193
size coef: 0.000095
admission_rate coef: -2.449650
income_med coef: -0.000011
cost_attendance coef: 0.000009
r^2: 0.493


Adding the cost of attendance to our predictors doesn't improve the $r^2$ very much.

In [13]:
def remove_outliers(df, outliers = {"Harvard University", "Stanford University", "Yale University"}):
    return df[~df.name.isin(outliers)]

What if we remove Harvard, Stanford, and Yale as in our earlier exploratory analysis?

In [14]:
filtered_instagram_details = remove_outliers(instagram_details)

reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_med')
reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'log_follower_med')

size coef: 4.697836
admission_rate coef: -163826.635340
income_med coef: -0.615451
cost_attendance coef: -0.165052
r^2: 0.469
size coef: 0.000098
admission_rate coef: -2.191500
income_med coef: -0.000002
cost_attendance coef: 0.000006
r^2: 0.558


The same predictors of size, admission rate, and median income yield a much higher $r^2 = 0.469$ for the normal version and $r^2 = 0.558$ for the log version!

What if we try to predict follower percent increase from size, admission rate, and median income?

In [15]:
instagram_details['follower_pct_increase'] = (instagram_details['follower_max'] - instagram_details['follower_min']) / instagram_details['follower_min'] * 100
filtered_instagram_details = remove_outliers(instagram_details)
instagram_details['follower_pct_increase'].describe()

count     69.000000
mean      28.956461
std       14.540610
min       14.201183
25%       21.788674
50%       26.530612
75%       32.625995
max      116.908213
Name: follower_pct_increase, dtype: float64

Over the past year, the average college Instagram account has grown 28.96% from its minimum to its maximum over that period, with a minimum increase of 14.20% from the colleges we have data for to a maximum of 116.91% (Johns Hopkins).

In [16]:
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_pct_increase')

size coef: -0.000147
admission_rate coef: 10.732157
income_med coef: -0.000087
cost_attendance coef: 0.000198
r^2: 0.060


In [17]:
reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_pct_increase')

size coef: -0.000148
admission_rate coef: 10.554681
income_med coef: -0.000091
cost_attendance coef: 0.000198
r^2: 0.059


We are much less confident in predicting follower percent increase with the same predictors we used to predict median followers, even with outliers removed.

In [18]:
reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'follower_pct_increase', 'cost_attendance'], 'follower_med')

size coef: 4.699903
admission_rate coef: -163973.871382
income_med coef: -0.614179
follower_pct_increase coef: 13.949833
cost_attendance coef: -0.167821
r^2: 0.469


Adding follower percent increase to our predictors doesn't increase the model's performance on predicting median followers.

## Multivariate regression, adding in categorical data

To do linear regression on college ownership, region, and locale, we'll need to make indicator variables. For ownership this is simple enough (public/private), but for region and locale it's a bit more complicated.

In [19]:
instagram_details['is_private'] = pd.get_dummies(instagram_details['ownership'])['private non-profit']

# Map locale type to a number, in increasing order of "cityness"
locale_type_map = {"rural": 1, "town": 2, "suburb": 3, "city": 4}
instagram_details['locale_type_num'] = instagram_details['locale_type'].map(locale_type_map)

# Map locale size as well (small city, large town, etc)
locale_size_map = {"small": 1, "medium": 2, "large": 3}
instagram_details['locale_size_num'] = instagram_details['locale_size'].map(locale_size_map)

instagram_details.head()

Unnamed: 0,name,instagram,follower_curr,follower_mean,follower_med,follower_std,follower_min,follower_max,following_curr,following_mean,...,ownership,region,state,locale_type,locale_size,log_follower_med,follower_pct_increase,is_private,locale_type_num,locale_size_num
0,University of Chicago,uchicago,116800.0,102192.602041,100650.0,9675.450589,85100.0,116800.0,284.0,271.576531,...,private non-profit,great lakes,IL,city,large,11.519404,37.250294,1,4,3
1,Yale University,yale,490700.0,445869.550173,460900.0,38742.023912,369300.0,490700.0,249.0,244.269896,...,private non-profit,new england,CT,city,medium,13.040936,32.873003,1,4,2
2,Brown University,brownu,193000.0,185870.992366,187900.0,5583.446131,169000.0,193000.0,141.0,120.185751,...,private non-profit,new england,RI,city,medium,12.143665,14.201183,1,4,2
3,Dartmouth College,dartmouthcollege,65400.0,58784.468193,58100.0,3262.492004,53200.0,65400.0,2278.0,2159.715013,...,private non-profit,new england,NH,town,small,10.969921,22.932331,1,2,1
4,Haverford College,haverfordedu,9063.0,7931.378109,8007.0,692.013379,6897.0,9063.0,207.0,169.621891,...,private non-profit,mid east,PA,suburb,large,8.988071,31.404959,1,3,3


In [20]:
# There isn't a natural order for region, so we will predict using each region as a dummy variable
region_dummies = pd.get_dummies(instagram_details['region'])

region_dummies.head()

Unnamed: 0,far west,great lakes,mid east,new england,plains,rocky mountains,southeast,southwest
0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0


In [21]:
reg_stats(instagram_details, ['is_private'], 'follower_med')

is_private coef: 13735.530000
r^2: 0.001


In [22]:
# With Harvard, etc. removed
filtered_instagram_details = remove_outliers(instagram_details)
reg_stats(filtered_instagram_details, ['is_private'], 'follower_med')

is_private coef: -36077.989362
r^2: 0.043


Using is_private to predict is (surprisingly?) not very helpful.

In [23]:
reg_stats(filtered_instagram_details, ['is_private', 'locale_type_num'], 'follower_med')

is_private coef: -18401.738280
locale_type_num coef: 41321.707371
r^2: 0.176


In [24]:
reg_stats(filtered_instagram_details, ['is_private', 'locale_type_num', 'locale_size_num'], 'follower_med')

is_private coef: -17641.207248
locale_type_num coef: 41544.024596
locale_size_num coef: -1683.368362
r^2: 0.177


It is improved when adding in the locale type, and adding the locale size on top of that doesn't provide much additional insight.

In [25]:
reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance', 'is_private', 'locale_type_num', 'locale_size_num'], 'follower_med')

size coef: 3.910076
admission_rate coef: -164903.145845
income_med coef: -0.564143
cost_attendance coef: 1.010706
is_private coef: -52188.053258
locale_type_num coef: 21933.083956
locale_size_num coef: -8927.224760
r^2: 0.503


Predicting with everything discussed so far except region manages to explain about 50% of the variability in median number of followers.

In [26]:
reg = LinearRegression().fit(region_dummies, instagram_details['follower_med'])
print("r^2: {}".format(reg.score(region_dummies, instagram_details['follower_med'])))

r^2: 0.05923232019422253


Predicting with only the dummy variables for region, the region has a small impact.

In [27]:
def all_predict(predictors, df, target):
    reg = LinearRegression().fit(predictors, df[target])
    for i in range(len(predictors.columns)):
        print("{} coef: {}".format(predictors.columns[i], reg.coef_[i]))
    print("r^2: {:.3f}".format(reg.score(predictors, df[target])))

Now let's use size, admission rate, income med, is private, and locale type joined with region dummies as predictors.

In [28]:
# All schools, predicting follower median
# Join the subset of predictors from instagram_details with the region dummies
predictors = instagram_details[['size', 'admission_rate', 'income_med', 'cost_attendance', 'is_private', 'locale_type_num', 'locale_size_num']].join(region_dummies)

all_predict(predictors, instagram_details, 'follower_med')

size coef: 2.825082683917496
admission_rate coef: -302975.80497739394
income_med coef: -3.2020588147185784
cost_attendance coef: 1.1611096786536215
is_private coef: -67941.42623974949
locale_type_num coef: 56737.365989862774
locale_size_num coef: -20021.179493766333
far west coef: -25278.793394196993
great lakes coef: 36332.19541222751
mid east coef: -15833.714182726415
new england coef: 92319.62083914959
plains coef: 28573.99201383988
rocky mountains coef: 22605.624536122148
southeast coef: -60682.777523019795
southwest coef: -78036.14770139253
r^2: 0.276


In [29]:
# Schools excluding Harvard, Yale, Stanford
predictors = filtered_instagram_details[['size', 'admission_rate', 'income_med', 'cost_attendance', 'is_private', 'locale_type_num', 'locale_size_num']].join(region_dummies)

all_predict(predictors, filtered_instagram_details, 'follower_med')

size coef: 4.397344832826317
admission_rate coef: -217040.55995463178
income_med coef: -0.6521270356744672
cost_attendance coef: -0.8099670197642217
is_private coef: 940.4531781470251
locale_type_num coef: 20292.190140506038
locale_size_num coef: -4196.287797521079
far west coef: -27412.67690903512
great lakes coef: 24388.911847819996
mid east coef: 36044.15164325849
new england coef: 16396.080359230065
plains coef: 796.4727671385137
rocky mountains coef: 23819.647950337974
southeast coef: -17459.326333463006
southwest coef: -56573.261325287334
r^2: 0.595


Wow! Regression on the school size, admission rate, median income, public/private, locale type, and region manages to get an $r^2$ of 0.595.

In [30]:
# Predicting follower percent increase?

all_predict(predictors, filtered_instagram_details, 'follower_pct_increase')

size coef: -0.0005587527983657784
admission_rate coef: 15.276682739266032
income_med coef: -3.7241008539479e-05
cost_attendance coef: 0.00016257949918339302
is_private coef: -9.154602228579012
locale_type_num coef: 7.371861146438906
locale_size_num coef: 2.564361586436941
far west coef: 0.9298488688484984
great lakes coef: -1.9538056139336657
mid east coef: 7.535494906348923
new england coef: 1.9452749992134775
plains coef: 7.367241064671449
rocky mountains coef: -9.372474938408025
southeast coef: -3.962763144738548
southwest coef: -2.4888161420020753
r^2: 0.182


not good

In [31]:
# Predicting median number of accounts following

all_predict(predictors, filtered_instagram_details, 'following_med')

size coef: -0.019804490375923512
admission_rate coef: 4398.628847351812
income_med coef: 0.005329491687019369
cost_attendance coef: 0.1134322026472328
is_private coef: -4595.557968494785
locale_type_num coef: -196.14076956973318
locale_size_num coef: 239.35248903287058
far west coef: -328.47122072200415
great lakes coef: 333.1261903444764
mid east coef: -56.52558871269885
new england coef: 327.6262264841977
plains coef: -380.8827560989822
rocky mountains coef: -1329.9104128769304
southeast coef: 1149.1901239728475
southwest coef: 285.8474376091043
r^2: 0.371


In [32]:
# Predicting current number of posts

all_predict(predictors, filtered_instagram_details, 'posts_curr')

size coef: 0.0013011285141217846
admission_rate coef: 346.08231302885764
income_med coef: -0.006300641433672889
cost_attendance coef: 0.054943217773041245
is_private coef: -2605.235079175117
locale_type_num coef: 119.77899732436097
locale_size_num coef: 188.40571455099823
far west coef: -796.6184230521163
great lakes coef: 245.26850616386935
mid east coef: 317.89006079595396
new england coef: 327.1702262831198
plains coef: -3.3457766415437056
rocky mountains coef: -352.9181104531325
southeast coef: 722.8016197430794
southwest coef: -460.24810283922886
r^2: 0.333


### What about SAT Score?

In [33]:
# Some universities are missing SAT data
instagram_details_sat = instagram_details.dropna(subset=['sat_score'])
filtered_instagram_details_sat = remove_outliers(instagram_details_sat)
filtered_instagram_details_sat.head()

Unnamed: 0,name,instagram,follower_curr,follower_mean,follower_med,follower_std,follower_min,follower_max,following_curr,following_mean,...,ownership,region,state,locale_type,locale_size,log_follower_med,follower_pct_increase,is_private,locale_type_num,locale_size_num
0,University of Chicago,uchicago,116800.0,102192.602041,100650.0,9675.450589,85100.0,116800.0,284.0,271.576531,...,private non-profit,great lakes,IL,city,large,11.519404,37.250294,1,4,3
2,Brown University,brownu,193000.0,185870.992366,187900.0,5583.446131,169000.0,193000.0,141.0,120.185751,...,private non-profit,new england,RI,city,medium,12.143665,14.201183,1,4,2
3,Dartmouth College,dartmouthcollege,65400.0,58784.468193,58100.0,3262.492004,53200.0,65400.0,2278.0,2159.715013,...,private non-profit,new england,NH,town,small,10.969921,22.932331,1,2,1
4,Haverford College,haverfordedu,9063.0,7931.378109,8007.0,692.013379,6897.0,9063.0,207.0,169.621891,...,private non-profit,mid east,PA,suburb,large,8.988071,31.404959,1,3,3
6,University of California-Los Angeles,ucla,278400.0,254695.576623,251000.0,13615.915193,232100.0,278400.0,93.0,88.137662,...,public,far west,CA,city,large,12.433208,19.948298,0,4,3


In [34]:
reg_stats(instagram_details_sat, ['sat_score'], 'follower_med')

sat_score coef: 406.978652
r^2: 0.043


In [35]:
predictors = filtered_instagram_details_sat[['size', 'admission_rate', 'income_med', 'cost_attendance', 'is_private', 'locale_type_num', 'locale_size_num', 'sat_score']].join(region_dummies)

all_predict(predictors, filtered_instagram_details_sat, 'follower_med')
best_reg = LinearRegression().fit(predictors, filtered_instagram_details_sat['follower_med'])

size coef: 4.961921966251231
admission_rate coef: -204369.88470645467
income_med coef: -0.8056774360939859
cost_attendance coef: -2.8219207845259997
is_private coef: 85921.94180040935
locale_type_num coef: 13994.13234707104
locale_size_num coef: -3292.358858703524
sat_score coef: 205.92071233419395
far west coef: -32159.054272267178
great lakes coef: 23354.66579289776
mid east coef: 28881.810367001443
new england coef: 11912.16838397244
plains coef: -11827.531832012135
rocky mountains coef: 75181.19032491097
southeast coef: -28461.039540290945
southwest coef: -66882.2092242125
r^2: 0.609


Adding SAT score as a predictor does not drastically improve the $r^2$.

### Is this significant?
Let's permute the output 

In [36]:
def permute_rows(df):
    permuted = df.sample(frac=1).reset_index(drop=True)
    return permuted
def permute_some_cols(df, cols):
    df = df.copy()
    for col in cols:
        df[col] = permute_rows(df[[col]])
    return df

In [37]:
# Also reset index to make permutation work correctly
permuted_insta = permute_some_cols(filtered_instagram_details_sat.reset_index(drop=True), ['follower_med'])
permuted_insta.head()
predictors = predictors.reset_index(drop=True)
all_predict(predictors, permuted_insta, 'follower_med') # This is the same as below line, but printed
permute_reg = LinearRegression().fit(predictors, permuted_insta['follower_med'])

size coef: 0.03246015170389468
admission_rate coef: 86306.1534205706
income_med coef: -0.529866026635963
cost_attendance coef: -9.326440901972017
is_private coef: 383976.4298320461
locale_type_num coef: 784.9498812040543
locale_size_num coef: 17954.881876452517
sat_score coef: 411.07799979628174
far west coef: 4488.039241854555
great lakes coef: 61693.1279553721
mid east coef: 34424.37771114308
new england coef: 13107.79881147349
plains coef: 10543.26990275685
rocky mountains coef: 10284.356814977724
southeast coef: -28644.25022597506
southwest coef: -105896.72021160364
r^2: 0.270


In [38]:
# Are the coefficients larger than the actual data?
print(permute_reg.coef_ > best_reg.coef_)

[False  True  True False  True False  True  True  True  True  True  True
  True False False False]


Let's repeat this $n = 1000$ times to approximate the p-value.

In [39]:
n = 1000
m = len(predictors.columns)

In [40]:
permute_slopes = np.zeros((n, m))

for i in range(n):
    permuted_insta = permute_some_cols(filtered_instagram_details_sat.reset_index(drop=True), ['follower_med'])
    permute_reg = LinearRegression().fit(predictors, permuted_insta['follower_med'])
    permute_slopes[i] = permute_reg.coef_

In [41]:
# What proportion of simulated slopes were greater than the actual slope?
follower_med_pvalue = sum([coefs > best_reg.coef_ for coefs in permute_slopes]) / n
for i in range(m):
    print(predictors.columns[i], "p-value:", follower_med_pvalue[i])

size p-value: 0.01
admission_rate p-value: 0.902
income_med p-value: 0.901
cost_attendance p-value: 0.73
is_private p-value: 0.336
locale_type_num p-value: 0.253
locale_size_num p-value: 0.567
sat_score p-value: 0.253
far west p-value: 0.872
great lakes p-value: 0.202
mid east p-value: 0.155
new england p-value: 0.315
plains p-value: 0.569
rocky mountains p-value: 0.113
southeast p-value: 0.811
southwest p-value: 0.931


All of the p-values except for size are above a critical value of 0.05. For size, it's very unlikely that randomized follower medians would produce a slope greater than in the actual data, so we can reject the null hypothesis and conclude that university size does have an influence on the follower count of their Instagram account.

### Using Instagram stats to predict college properties

In [42]:
# Data frame of Instagram stats to use as predictors
insta_stats = instagram_details[['follower_med', 'following_med', 'posts_med']]

In [43]:
# Predict admission rate
all_predict(insta_stats, instagram_details, 'admission_rate')

follower_med coef: -1.7544128717508555e-07
following_med coef: 6.477789271442531e-05
posts_med coef: -2.3319095481641926e-05
r^2: 0.145


In [44]:
# Predict median income
all_predict(insta_stats, instagram_details, 'income_med')

follower_med coef: -0.029687631211307613
following_med coef: 3.318894224680521
posts_med coef: -3.2803662334266823
r^2: 0.160


In [45]:
# Predict student body size
all_predict(insta_stats, instagram_details, 'size')

follower_med coef: 0.00662543406230861
following_med coef: 0.745900509093363
posts_med coef: 1.683047148463907
r^2: 0.048


In [46]:
# Predict cost of attendance
all_predict(insta_stats, instagram_details, 'cost_attendance')

follower_med coef: 0.003017130445237828
following_med coef: -3.091473792381729
posts_med coef: -1.069569469848272
r^2: 0.049


None of these relationships are particularly strong.