# INFO 2950 Phase 2

## Research Question
What factors affect the Instagram trends and statistics of four-year colleges and universities in the U.S.?

## Data Collection and Cleaning
Our dataset consists of three sections: Instagram Data, College Scorecard Data, and Merged Data.

More detailed information and code on our data cleaning can be found in our [data cleaning notebook](phase2-data.ipynb).  

### Instagram Data

We obtained Instagram data (number of followers, following, and post count) by web scraping the Instagram pages of various universities over a period of >1 year. This was performed once per day, with a cron job running a Python script at noon to get the current counts and appending them to the appropriate output file for the university. The data was then cleaned and prepared for analysis by adding column headers, fixing date types, and cleaning numerical columns. Detailed code is available in the data cleaning notebook linked above.

Eg. first 3 lines of `cornelluniversity.out`:
```
2019-09-08 15:42:32.026749,190.5k,160,1773
2019-09-09 12:00:03.187746,190.7k,160,1774
2019-09-10 12:00:02.818769,190.9k,160,1775
```

1. Read `.out` files and convert them to dataframes
0. Add column header: `columns = ["date", "followers", "following", "posts"]`
0. Convert first column to datetime
0. Apply function that replaces `k` and `m` to clean up numeric columns
0. Write cleaned dataframes to `.csv` files

Eg. first 4 lines of `cornelluniversity.csv`:
```
date,followers,following,posts
2019-09-08 15:42:32.026749,190500,160,1773
2019-09-09 12:00:03.187746,190700,160,1774
2019-09-10 12:00:02.818769,190900,160,1775
```

### Scorecard Data

Attributes about individual universities were collected from the US Department of Education's [College Scorecard](https://collegescorecard.ed.gov/data/). Due to the massive size of the raw dataset, we fetched a filtered version of this data through the API, receiving it in JSON format as the CSV endpoint returned corrupt data. We then cleaned the data by renaming columns, mapping categorial integers to strings, splitting locale into two columns, and dropping columns. As before, detailed code is available in the data cleaning notebook linked above.

Eg. first 23 lines of `scorecard_raw.json`:
```
[
  {
    "id": 147244,
    "latest.admissions.admission_rate.overall": 0.6126,
    "latest.admissions.sat_scores.average.overall": 1125.0,
    "latest.cost.attendance.academic_year": 46026,
    "latest.cost.avg_net_price.private": 20560,
    "latest.cost.avg_net_price.public": null,
    "latest.student.demographics.avg_family_income": 66334,
    "latest.student.demographics.median_family_income": 49741,
    "latest.student.size": 1918,
    "location.lat": 39.842612,
    "location.lon": -88.976298,
    "school.carnegie_size_setting": 11,
    "school.carnegie_undergrad": 13,
    "school.city": "Decatur",
    "school.locale": 13,
    "school.name": "Millikin University",
    "school.ownership": 2,
    "school.region_id": 3,
    "school.state": "IL",
    "school.zip": "62522-2084"
  },
```

1. Load `.json` file and convert it to dataframe
2. Rename columns
3. Rename entries from integers to human-readable values for ownership, region, and locale
4. Drop more columns for readability (we may choose to include them later)
5. Write cleaned dataframe to `.csv` file

Eg. first 2 lines of `scorecard.csv`:
```
admission_rate,sat_score,cost_attendance,income_avg,income_med,size,lat,lon,city,name,ownership,region,state,locale_type,locale_size
0.6126,1125.0,46026.0,66334.0,49741.0,1918.0,39.842612,-88.976298,Decatur,Millikin University,private non-profit,great lakes,IL,city,small
```

### Merged Data

We merged the data with a left join between a new Instagram summary statistics dataframe and the scorecard dataframe.

1. Generate summary statistics for the Instagram dataset and create a dataframe from it
0. Using a mapping from Instagram handles to college/university name, add a column "name" to the Instagram dataframe
0. Merge the dataframes with a left join on the "name" column
0. Write the result to a `.csv` file

Eg. first 2 lines of `instagram-details.csv`:
```
instagram,follower_curr,follower_mean,follower_med,follower_std,follower_min,follower_max,following_curr,following_mean,following_med,following_std,following_min,following_max,posts_curr,posts_mean,posts_med,posts_std,posts_min,posts_max,name,id,admission_rate,sat_score,cost_attendance,net_price_private,net_price_public,income_avg,income_med,size,lat,lon,carnegie_size_setting,carnegie_undergrad,city,locale,ownership,region,state,zip,locale_type,locale_size

amherstcollege,17700.0,15826.912928759895,15600.0,934.7289567341286,14200.0,17700.0,650.0,636.0343007915567,636.0,12.17145525352865,617.0,655.0,2043.0,1964.8416886543537,1960.0,46.55475530582614,1886.0,2043.0,Amherst College,164465.0,0.12810000000000002,1449.0,71300.0,25208.0,,78988.0,42053.0,1855.0,42.372459,-72.518493,11.0,14.0,Amherst,21.0,private non-profit,new england,MA,01002-5000,suburb,large
```

## Data Description

Our three main datasets are Instagram Data, Scorecard Data, and Merged Data, with merged data being a combination of the two aforementioned datasets.

### Instagram Data
Instagram data was collected per-university. For every university, there is a table of observations with rows being timestamped days from the past year and columns being the number of followers, following, and posts on the university's Instagram page. This dataset was created by Changyuan as a personal project starting last year, and has been autonomously collecting data ever since. The 70 Instagram accounts tracked were chosen by hand, so they are not necessarily representative of all universities. During certain time intervals, data was missing due to rate limiting or unexpected modifications to the structure of the scraped website. No preprocessing was done before the data was received and cleaned for this project, and no data was collected on accounts belonging to individual people, only institutions. Raw data is available in the GitHub repository.

### Scorecard Data
Scorecard data contains universities and their attributes. The rows are universities, and the columns are admission rate, SAT score, cost of attendance, average income, median income, size of student population, latitude, longitude, city, name, ownership (private/public), region, state, locale type, and locale size. More attributes are present in the full dataset, which was assembled by the US Department of Education as part of the College Scorecard Project. The College Scorecard Project is designed to increase transparency and help students and families compare postsecondary institutions. The data originates from federal reporting from institutions, data on federal financial aid, and tax information, reported to the Integrated Postsecondary Education Data System (IPEDS). For many elements, data is processed and pooled across multiple years to reduce year to year variation in figures. Student-level data comes from the universities themselves or through recipients of federal student aid, who are likely aware that their anonymized statistics will be included as part of the institution's records. The raw College Scorecard data used in this project can be found at https://collegescorecard.ed.gov/data/.

### Merged Data
Merged data is a combination of the Instagram Data and Scorecard Data. It is a smaller dataset, as it only contains the subset of universities that we collected Instagram data for over the past year. All universities that were scraped are present in the Scorecard data. Since Instagram data exists by university, attribute, and time, we had to drop one to reasonably fit inside a 2D dataframe. As such, the data points for each university over the course of a year were compressed into their latest values and summary statistics. Merged data contains all columns in the Scorecard Data, plus columns for current and summary statistics of followers, following, and post count. The merged data only contains rows/observations for universities that we were able to map an Instagram account that we collected data to.

## Data Limitations

### Instagram Data
1. We have chunks of missing or corrupt data. We think this is due to Instagram's rate limiting, and mishandling of rate-limiting errors when collecting this data, or changes to Instagram's website that persisted for a period of time and were then removed. This can be observed in the data exploration graphs. We attempted to fix some of this (particularly range and rate of change) by recently scraping more data and ensuring that the beginning and ending dates had fully populated entries.
2. The Instagram handles are handpicked by Changyuan, and is thus not comprehensive. It is not necessarily representative of all types of colleges/universities in the U.S. For example, there tends to be more Colorado schools (where Changyuan is from).
3. Instagram statistics are not as granular as we would've liked (eg. `k` to denote thousand and `m` to denote million). As a result, there is some stepping that can be observed in the graphs (especially for Harvard and other Instagram accounts with high follower counts).
3. Our dataset is 3-dimensional: Instagram handle, statistics, time. As a result, there is some information loss when collapsing it to 2D. In this phase, we chose to collapse statistics and time into one variable: summary statistics over time. This can be observed in the exploratory data analysis for our merged dataset.

### Scorecard Data
1. This dataset is massive, so it had to be heavily filtered. Many columns that we deemed irrelevant were dropped. We also filtered the dataset down to four-year colleges. This limitation caused us to change our research question to be more specific.

## Exploratory Data Analysis

More detailed information (and more exploratory data analysis) can be found in our [data exploration notebook](phase2-exploration.ipynb).

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt

### Instagram Statistics

In [2]:
instagram_dir = "../instagram" # instagram dataset
instagram_files = os.listdir(instagram_dir)

instagram_df = {}
for file in instagram_files:
    df = pd.read_csv(os.path.join(instagram_dir, file))
    df.date = pd.to_datetime(df.date)
    instagram_df[file.split(".")[0]] = df

### College/University Information

### Merged Dataset

## Questions for Reviewers

#### Everyone:
1. Is our research question clear/specific enough? Is there significant potential for interesting analyses?
2. Did we miss any potentially interesting routes to explore? What are some other questions you have that might be answered by our data?
3. Do you have any other suggestions for how to improve our project?

#### Project Mentor:
1. Where do we go after our initial exploratory data analysis?
2. What are some examples of preregistration of analyses? What are some example of final phase analyses?