# INFO 2950 Phase4

## Introduction

As social media becomes ever more prevalent in our lives, companies and other institutions have needed to carve out an online presence. Universities are no exception to this, and the overwhelming majority of colleges have their own Instagram accounts in 2020. But how much do these Instagram accounts reflect the schools behind them? We sought to answer questions like this by analyzing the Instagram followers/following/post counts of 70 or so universities, comparing them to descriptive attributes such as student body size and acceptance rate.

In our analysis, we explored Instagram statistics that we collected ourselves over the past year along with college properties from the Department of Education's College Scorecard Project. We then combined the two datasets and ran regression models to see if we could predict Instagram statistics from college properties, and college properties from their Instagram statistics.

TODO briefly main findings

### Research Question
Do attributes of four-year colleges and universities in the U.S. such as size, admission rate, and household income affect the corresponding Instagram account's number of followers, following, and posts?

## Data Description

Our three main datasets are Instagram Data, Scorecard Data, and Merged Data, with merged data being a combination of the two aforementioned datasets.

### Instagram Data
Instagram data was collected per-university. For every university, there is a table of observations with rows being timestamped days from the past year and columns being the number of followers, following, and posts on the university's Instagram page. This dataset was created by Changyuan as a personal project starting last year, and has been autonomously collecting data ever since. The 70 Instagram accounts tracked were chosen by hand, so they are not necessarily representative of all universities. During certain time intervals, data was missing due to rate limiting or unexpected modifications to the structure of the scraped website. No preprocessing was done before the data was received and cleaned for this project, and no data was collected on accounts belonging to individual people, only institutions. Raw data is available in the source code (below).

### Scorecard Data
Scorecard data contains universities and their attributes. The rows are universities, and the columns are admission rate, SAT score, cost of attendance, average income, median income, size of student population, latitude, longitude, city, name, ownership (private/public), region, state, locale type, and locale size. More attributes are present in the full dataset, which was assembled by the US Department of Education as part of the College Scorecard Project. The College Scorecard Project is designed to increase transparency and help students and families compare postsecondary institutions. The data originates from federal reporting from institutions, data on federal financial aid, and tax information, reported to the Integrated Postsecondary Education Data System (IPEDS). For many elements, data is processed and pooled across multiple years to reduce year to year variation in figures. Student-level data comes from the universities themselves or through recipients of federal student aid, who are likely aware that their anonymized statistics will be included as part of the institution's records. The raw College Scorecard data used in this project can be found at https://collegescorecard.ed.gov/data/.

### Merged Data
Merged data is a combination of the Instagram Data and Scorecard Data. It is a smaller dataset, as it only contains the subset of universities that we collected Instagram data for over the past year. All universities that were scraped are present in the Scorecard data. Since Instagram data exists by university, attribute, and time, we had to drop one to reasonably fit inside a 2D dataframe. As such, the data points for each university over the course of a year were compressed into their latest values and summary statistics. Merged data contains all columns in the Scorecard Data, plus columns for current and summary statistics of followers, following, and post count. The merged data only contains rows/observations for universities that we were able to map an Instagram account that we collected data to. This final version of the cleaned data is in [`instagram_details.csv`](../data/instagram_details.csv).

## Preregistration Statement

These are the our two pre-registrations of analysis.

1. Perform linear regression on Instagram data vs college stats to see if there is a linear relationship. Specifically, follower count and follower percent increase vs. size and admission rate.

2. Perform linear regression on Instagram data vs categorical variables that we ignored before. Specifically, college ownership, region, and locale.

## Data Analysis

### Exploration
Here are the most interesting findings from our data exploration!

1. When plotting follower counts, we noticed that there was a slight dip around the end of March / beginning of April (consistently across different schools). We think that this is because of acceptance results coming out, and that most of the unfollows are from seniors who didn't get accepted (or got accepted to better schools).
2. Johns Hopkins University had the most drastic follower percent increase, at over +100%. This may be due to JHU being a leader in publicizing COVID data. Their follower count shot up starting in March, which coincidentally lines up with a shift in their Instagram posts' aesthetic to stress-relieving pictures (starting after [this post](https://www.instagram.com/p/B-FlWBFhSrk/)).
3. In the merged dataset, we found a medium correlation between follower median and college size. This led us to focus on those variables are predictors and outputs for our regression models.

More comprehensive analysis can be found in [phase2.ipynb](../p2/phase2.ipynb).

### Linear Regression
Here are the most interesting findings from our linear regressions!

1. 

More comprehensive analysis can be found in [phase4-linear.ipynb](./phase4-linear.ipynb).

### Logistic Regression
Here are the most interesting findings from our logistic regressions!

1. It seems like the follower data might be able to predict a school's admission rate and whether or not it's in a city, but not student family income or cost of attendance. The percent accuracy for the former outcome variables were consistently higher than 50%.
2. Removing Harvard, Stanford, and Yale did not really affect the model scores. This might be because Harvard, Stanford, and Yale are outliers with respect to popularity and Instagram data, but not institutional data. So our models might be able to predict outcomes for those schools with the same accuracy as the other institutions.

More comprehensive analysis can be found in [phase4-logistic.ipynb](./phase4-logistic.ipynb).

## Evaluation of Significance

We found our findings from our regression models to be mostly insignificant. The p-values we calculated were too large, so we accepted the null hypotheses. This may be because there is no relationship between Instagram statistics and institution data, or because our sample size was too small.

The exception was for size as a predictor in our linear models. However, since the other variables in the multivariable linear regressions had high p-values, so there might be interdependencies that invalidate the significance of institution size. (Peer review feedback would be helpful here.)

More comprehensive significance evaluation can be found at the ends of [phase4-linear.ipynb](./phase4-linear.ipynb) and [phase4-logistic.ipynb](./phase4-logistic.ipynb).

## Interpretation and Conclusions

TODO
What did you find over the course of your data analysis, and how confident are you in these conclusions? Detail your results more so than in the introduction, now that the reader is familiar with your methods and analysis. Interpret these results in the wider context of the real-life application from where your data hails.  
r^2 close to zero for percent increases  
if we were to repeat this experiment...

## Limitations

### Instagram Data
1. We have chunks of missing or corrupt data. We think this is due to Instagram's rate limiting, and mishandling of rate-limiting errors when collecting this data, or changes to Instagram's website that persisted for a period of time and were then removed. This can be observed in the data exploration graphs. We dealt with this (particularly range and rate of change) by recently scraping more data and ensuring that the beginning and ending dates had fully populated entries.
2. The Instagram handles are handpicked by Changyuan, and is thus not comprehensive. It is not necessarily representative of all types of colleges/universities in the U.S. For example, there tends to be more Colorado schools (where Changyuan is from).
3. Instagram statistics are not as granular as we would've liked (eg. `k` to denote thousand and `m` to denote million). As a result, there is some stepping that can be observed in the graphs (especially for Harvard and other Instagram accounts with high follower counts). If we were to do this again, we would either scrape from a third-party site (like https://instastatistics.com/) or scrape HTML tag metadata instead of the text.
3. Our dataset is 3-dimensional: Instagram handle, statistics, time. As a result, there is some information loss when collapsing it to 2D. In this phase, we chose to collapse statistics and time into one variable: summary statistics over time. This can be observed in the exploratory data analysis for our merged dataset. An alternative approach, as suggested by our mentor William, would be to use [other dimensionality reduction techniques](https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/). We decided against this since it was outside the scope of the class and would take more time than we had to understand everything from scratch.

### Scorecard Data
1. This dataset is massive, so it had to be heavily filtered. Many columns that we deemed irrelevant were dropped. We also filtered the dataset down to four-year colleges. This limitation caused us to change our research question to be more specific.

### Data Analysis and Evaluation of Significance
1. What we decided to show in the final project, as well as what we decided to explore (eg. which predictor/outcome variables), are all subject to our biases as data scientists. Our primary goal was to find something interesting (significant or not), which is entirely subjective and could have led to some personal bias in our analysis.
2. Our sample size is extremely small (~70 institutions) compared to what our project is attempting to generalize to (four-year colleges and universities in the U.S.). So, all our findings, including insignificance, should be taken with a grain of salt.

## Source Code

https://github.coecis.cornell.edu/bfs45/info2950-project

## Acknowledgments

### Contributors
Benjamin Shen (bfs45)  
Changyuan Lin (cl859)  
Larina Fu (lrf59) - Phases 0,1  
William Jacob Bekerman (wjb239) - TA Mentor  

Additional thanks to all our peers that gave feedback on our project!

## Questions for Reviewers

1. Did we do enough for our final phase analysis? Can you think of any other ways to include the other stuff we learned in class (like clustering)?
2. How do we calculate significance for multi-variable predictors, for both linear and logistic regression (this is inquiring about https://campuswire.com/c/GFD0330A5/feed/668)? Did we do it right? Did we interpret it right? What does a low p-value for only one variable (but not the others) mean?