# INFO2950 Phase 4: Linear Regression

1. Perform linear regression on Instagram data vs college stats to see if there is a linear relationship. Specifically, follower count and follower percent increase vs. size and admission rate.
2. Perform linear regression on Instagram data vs categorical variables that we ignored before. Specifically, college ownership, region, and locale.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [2]:
data_dir = "../data"

## Merged Dataset

In [3]:
# Load merged dataset
instagram_details = pd.read_csv(os.path.join(data_dir, "instagram_details.csv"))
instagram_details.head()

Unnamed: 0,name,instagram,follower_curr,follower_mean,follower_med,follower_std,follower_min,follower_max,following_curr,following_mean,...,income_med,size,lat,lon,city,ownership,region,state,locale_type,locale_size
0,University of Chicago,uchicago,116800.0,102192.602041,100650.0,9675.450589,85100.0,116800.0,284.0,271.576531,...,47139.0,6600.0,41.787994,-87.599539,Chicago,private non-profit,great lakes,IL,city,large
1,Yale University,yale,490700.0,445869.550173,460900.0,38742.023912,369300.0,490700.0,249.0,244.269896,...,44004.0,5963.0,41.311158,-72.926688,New Haven,private non-profit,new england,CT,city,medium
2,Brown University,brownu,193000.0,185870.992366,187900.0,5583.446131,169000.0,193000.0,141.0,120.185751,...,82670.0,6752.0,41.82617,-71.40385,Providence,private non-profit,new england,RI,city,medium
3,Dartmouth College,dartmouthcollege,65400.0,58784.468193,58100.0,3262.492004,53200.0,65400.0,2278.0,2159.715013,...,68455.0,4312.0,43.704115,-72.289949,Hanover,private non-profit,new england,NH,town,small
4,Haverford College,haverfordedu,9063.0,7931.378109,8007.0,692.013379,6897.0,9063.0,207.0,169.621891,...,65396.0,1305.0,40.007452,-75.305207,Haverford,private non-profit,mid east,PA,suburb,large


In [4]:
# What are the columns again
instagram_details.columns

Index(['name', 'instagram', 'follower_curr', 'follower_mean', 'follower_med',
       'follower_std', 'follower_min', 'follower_max', 'following_curr',
       'following_mean', 'following_med', 'following_std', 'following_min',
       'following_max', 'posts_curr', 'posts_mean', 'posts_med', 'posts_std',
       'posts_min', 'posts_max', 'admission_rate', 'sat_score',
       'cost_attendance', 'income_avg', 'income_med', 'size', 'lat', 'lon',
       'city', 'ownership', 'region', 'state', 'locale_type', 'locale_size'],
      dtype='object')

In [5]:
# Drop @ivyleague for these analyses
instagram_details = instagram_details[instagram_details.instagram != "ivyleague"]

## Multivariate regression on numerical data

In [6]:
def reg_stats(df, predictors, target):
    reg = LinearRegression().fit(df[predictors], df[target])
    for i in range(len(predictors)):
        print("{} coef: {:.6f}".format(predictors[i], reg.coef_[i]))
    print("r^2: {:.3f}".format(reg.score(df[predictors], df[target])))

In [7]:
reg_stats(instagram_details, ['admission_rate'], 'follower_med')

admission_rate coef: -182343.380366
r^2: 0.042


Predicting the median followers using just the college's admission rate does not give much insight, with a very low $r^2$.

In [8]:
# What is the relationship between median income and average income
reg_stats(instagram_details, ['income_med'], 'income_avg')
# not related to the other cells v much

income_med coef: 0.899166
r^2: 0.841


In [9]:
reg_stats(instagram_details, ['size'], 'follower_med')

size coef: 2.551152
r^2: 0.018


Predicting the median followers using only size is even less useful.

In [10]:
reg_stats(instagram_details, ['size', 'admission_rate'], 'follower_med')

size coef: 5.399882
admission_rate coef: -297206.463067
r^2: 0.107


But using size and admission rate together as predictors, however, raises our $r^2$ to 0.107.

In [11]:
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med'], 'follower_med')

size coef: 3.607143
admission_rate coef: -265444.552864
income_med coef: -3.137739
r^2: 0.193


With size, admission rate, and median income as predictors, we get $r^2 = 0.193$

In [12]:
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_med')

size coef: 4.108562
admission_rate coef: -232603.178019
income_med coef: -3.195766
cost_attendance coef: 0.679158
r^2: 0.193


Adding cost of attendance is about the same as not adding it

In [13]:
def remove_outliers(df, outliers = {"Harvard University", "Stanford University", "Yale University"}):
    return df[~df.name.isin(outliers)]

What if we remove Harvard, Stanford, and Yale as in our earlier exploratory analysis?

In [14]:
filtered_instagram_details = remove_outliers(instagram_details)

reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_med')

size coef: 4.697836
admission_rate coef: -163826.635340
income_med coef: -0.615451
cost_attendance coef: -0.165052
r^2: 0.469


The same predictors of size, admission rate, and median income yield a much higher $r^2 = 0.469$!

What if we try to predict follower percent increase from size, admission rate, and median income?

In [15]:
instagram_details['follower_pct_increase'] = (instagram_details['follower_max'] - instagram_details['follower_min']) / instagram_details['follower_min'] * 100
filtered_instagram_details = remove_outliers(instagram_details)
instagram_details['follower_pct_increase'].describe()

count     69.000000
mean      28.956461
std       14.540610
min       14.201183
25%       21.788674
50%       26.530612
75%       32.625995
max      116.908213
Name: follower_pct_increase, dtype: float64

Over the past year, the average college Instagram account has grown 28.96% from its minimum to its maximum over that period, with a minimum increase of 14.20% from the colleges we have data for to a maximum of 116.91%.

In [16]:
reg_stats(instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_pct_increase')

size coef: -0.000147
admission_rate coef: 10.732157
income_med coef: -0.000087
cost_attendance coef: 0.000198
r^2: 0.060


In [17]:
reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance'], 'follower_pct_increase')

size coef: -0.000148
admission_rate coef: 10.554681
income_med coef: -0.000091
cost_attendance coef: 0.000198
r^2: 0.059


We are much less confident in predicting follower percent increase with the same predictors we used to predict median followers, even with outliers removed.

In [18]:
reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'follower_pct_increase', 'cost_attendance'], 'follower_med')

size coef: 4.699903
admission_rate coef: -163973.871382
income_med coef: -0.614179
follower_pct_increase coef: 13.949833
cost_attendance coef: -0.167821
r^2: 0.469


Adding follower percent increase to our predictors doesn't increase the model's performance on predicting median followers.

## Multivariate regression, adding in categorical data

To do linear regression on college ownership, region, and locale, we'll need to make indicator variables. For ownership this is simple enough (public/private), but for region and locale it's a bit more complicated.

In [19]:
instagram_details['is_private'] = pd.get_dummies(instagram_details['ownership'])['private non-profit']

# Map locale type to a number, in increasing order of "cityness"
locale_type_map = {"rural": 1, "town": 2, "suburb": 3, "city": 4}
instagram_details['locale_type_num'] = instagram_details['locale_type'].map(locale_type_map)

instagram_details.head()

Unnamed: 0,name,instagram,follower_curr,follower_mean,follower_med,follower_std,follower_min,follower_max,following_curr,following_mean,...,lon,city,ownership,region,state,locale_type,locale_size,follower_pct_increase,is_private,locale_type_num
0,University of Chicago,uchicago,116800.0,102192.602041,100650.0,9675.450589,85100.0,116800.0,284.0,271.576531,...,-87.599539,Chicago,private non-profit,great lakes,IL,city,large,37.250294,1,4
1,Yale University,yale,490700.0,445869.550173,460900.0,38742.023912,369300.0,490700.0,249.0,244.269896,...,-72.926688,New Haven,private non-profit,new england,CT,city,medium,32.873003,1,4
2,Brown University,brownu,193000.0,185870.992366,187900.0,5583.446131,169000.0,193000.0,141.0,120.185751,...,-71.40385,Providence,private non-profit,new england,RI,city,medium,14.201183,1,4
3,Dartmouth College,dartmouthcollege,65400.0,58784.468193,58100.0,3262.492004,53200.0,65400.0,2278.0,2159.715013,...,-72.289949,Hanover,private non-profit,new england,NH,town,small,22.932331,1,2
4,Haverford College,haverfordedu,9063.0,7931.378109,8007.0,692.013379,6897.0,9063.0,207.0,169.621891,...,-75.305207,Haverford,private non-profit,mid east,PA,suburb,large,31.404959,1,3


In [20]:
# There isn't a natural order for region, so we will predict using each region as a dummy variable
region_dummies = pd.get_dummies(instagram_details['region'])

region_dummies.head()

Unnamed: 0,far west,great lakes,mid east,new england,plains,rocky mountains,southeast,southwest
0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0


In [21]:
reg_stats(instagram_details, ['is_private'], 'follower_med')

is_private coef: 13735.530000
r^2: 0.001


In [22]:
# With Harvard, etc. removed
filtered_instagram_details = remove_outliers(instagram_details)
reg_stats(filtered_instagram_details, ['is_private'], 'follower_med')

is_private coef: -36077.989362
r^2: 0.043


Using is_private to predict is (surprisingly?) not very helpful.

In [23]:
reg_stats(filtered_instagram_details, ['is_private', 'locale_type_num'], 'follower_med')

is_private coef: -18401.738280
locale_type_num coef: 41321.707371
r^2: 0.176


It is improved when adding in the locale type.

In [24]:
reg_stats(filtered_instagram_details, ['size', 'admission_rate', 'income_med', 'cost_attendance', 'is_private', 'locale_type_num'], 'follower_med')

size coef: 3.907407
admission_rate coef: -162974.005970
income_med coef: -0.537504
cost_attendance coef: 0.775280
is_private coef: -45842.501669
locale_type_num coef: 20834.384560
r^2: 0.497


Predicting with everything discussed so far except region manages to explain almost 50% of the variability in median number of followers.

In [25]:
reg = LinearRegression().fit(region_dummies, instagram_details['follower_med'])
print("r^2: {}".format(reg.score(region_dummies, instagram_details['follower_med'])))

r^2: 0.05923232019422253


Predicting with only the dummy variables for region, the region has a small impact.

In [26]:
def all_predict(predictors, df, target):
    reg = LinearRegression().fit(predictors, df[target])
    for i in range(len(predictors.columns)):
        print("{} coef: {}".format(predictors.columns[i], reg.coef_[i]))
    print("r^2: {:.3f}".format(reg.score(predictors, df[target])))

Now let's use size, admission rate, income med, is private, and locale type joined with region dummies as predictors.

In [27]:
# All schools, predicting follower median
# Join the subset of predictors from instagram_details with the region dummies
predictors = instagram_details[['size', 'admission_rate', 'income_med', 'cost_attendance', 'is_private', 'locale_type_num']].join(region_dummies)

all_predict(predictors, instagram_details, 'follower_med')

size coef: 2.726625219507167
admission_rate coef: -294733.0245089307
income_med coef: -3.2039287212138423
cost_attendance coef: 0.8896388952455304
is_private coef: -67851.17147057487
locale_type_num coef: 57456.32618276485
far west coef: -31119.062714149964
great lakes coef: 35608.39759117596
mid east coef: -15330.84275672166
new england coef: 99970.23556244952
plains coef: 41152.20422867314
rocky mountains coef: 17815.820186250872
southeast coef: -60137.09808458026
southwest coef: -87959.65401309924
r^2: 0.272


In [28]:
# Schools excluding Harvard, Yale, Stanford
predictors = filtered_instagram_details[['size', 'admission_rate', 'income_med', 'cost_attendance', 'is_private', 'locale_type_num']].join(region_dummies)

all_predict(predictors, filtered_instagram_details, 'follower_med')

size coef: 4.384411121206706
admission_rate coef: -215319.24380301658
income_med coef: -0.6488438462996566
cost_attendance coef: -0.8806930856069419
is_private coef: 1634.9361428844186
locale_type_num coef: 20310.775622115532
far west coef: -28529.743934604252
great lakes coef: 24243.619365076338
mid east coef: 36206.94831200551
new england coef: 17778.071175393696
plains coef: 3277.183254988371
rocky mountains coef: 22937.631202837154
southeast coef: -17264.149421248003
southwest coef: -58649.559954447126
r^2: 0.593


Wow! Regression on the school size, admission rate, median income, public/private, locale type, and region manages to get an $r^2$ of 0.593.

In [29]:
# Predicting follower percent increase?

all_predict(predictors, filtered_instagram_details, 'follower_pct_increase')

size coef: -0.0005508489760132854
admission_rate coef: 14.224782289933392
income_med coef: -3.9247373489971696e-05
cost_attendance coef: 0.00020580036848850836
is_private coef: -9.579002442086633
locale_type_num coef: 7.360503513834872
far west coef: 1.6124912203901935
great lakes coef: -1.865017027562963
mid east coef: 7.436009469725475
new england coef: 1.1007370807533756
plains coef: 5.851272906846136
rocky mountains coef: -8.83347237828292
southeast coef: -4.0820362261841
southwest coef: -1.2199850456852652
r^2: 0.168


not good

In [30]:
# Predicting median number of accounts following

all_predict(predictors, filtered_instagram_details, 'following_med')

size coef: -0.01906676307480299
admission_rate coef: 4300.446519221225
income_med coef: 0.005142221511879059
cost_attendance coef: 0.11746635395588942
is_private coef: -4635.170653377607
locale_type_num coef: -197.20086875686437
far west coef: -264.7547203560116
great lakes coef: 341.41354313354105
mid east coef: -65.81136455815644
new england coef: 248.79871092039443
plains coef: -522.3802554334013
rocky mountains coef: -1279.6009693712324
southeast coef: 1138.0574081071882
southwest coef: 404.277647557659
r^2: 0.355


In [31]:
# Predicting current number of posts

all_predict(predictors, filtered_instagram_details, 'posts_curr')

size coef: 0.0018818287188277101
admission_rate coef: 268.7983383932008
income_med coef: -0.0064480506852818385
cost_attendance coef: 0.05811868990939299
is_private coef: -2636.41610524586
locale_type_num coef: 118.94454289418705
far west coef: -746.4641387634899
great lakes coef: 251.79187523460658
mid east coef: 310.5807855074791
new england coef: 265.1213440839566
plains coef: -114.72518020236691
rocky mountains coef: -313.31715752110557
southeast coef: 714.0385302258705
southwest coef: -367.0260585649434
r^2: 0.312


### Using Instagram stats to predict college properties

In [32]:
# Data frame of Instagram stats to use as predictors
insta_stats = instagram_details[['follower_med', 'following_med', 'posts_med']]

In [33]:
all_predict(insta_stats, instagram_details, 'admission_rate')

follower_med coef: -1.7544128717508555e-07
following_med coef: 6.477789271442531e-05
posts_med coef: -2.3319095481641926e-05
r^2: 0.145


In [34]:
all_predict(insta_stats, instagram_details, 'income_med')

follower_med coef: -0.029687631211307613
following_med coef: 3.318894224680521
posts_med coef: -3.2803662334266823
r^2: 0.160


In [35]:
all_predict(insta_stats, instagram_details, 'size')

follower_med coef: 0.00662543406230861
following_med coef: 0.745900509093363
posts_med coef: 1.683047148463907
r^2: 0.048


In [36]:
all_predict(insta_stats, instagram_details, 'cost_attendance')

follower_med coef: 0.003017130445237828
following_med coef: -3.091473792381729
posts_med coef: -1.069569469848272
r^2: 0.049


Not much lol