# Creating fake data for analysis

This notebook illustrates the methods behind the dataset we use in our analysis.


Goals:
1. Create a dataset similar in shape and content to that of COMIS (?)
  - one row for each course taken
      - student features
        - ids/ssn
        - demographics (age, sex)
        - student intent (transfer, AA, credential)
        - first time or returning
      - course features
        - course id
        - section id
        - units
        - grade in class
        - college id
        - department (Math, English)
        - dev ed (T/F)
      - term id (Year, Term)
2. Prepare data in similar fashion to that we receive
  - categoricals
  - fall-winter-spring-quarter system
3. Prepare a second dataset for __Enrollment__ (10 million rows)
  - columns: term, course name, course code, college id, units, credit/noncredit 

***
__Set up__

In [1]:
import pandas as pd
import numpy as np

We will end up with a dataset 10 million rows long but it should only have 1 million students (taking at most 10 classes).

First, the students: <br>
They must have an 1) id/ssn, 2) age, 3) sex, 4) intent, 5) first time status.

In [30]:
np.random.seed(41)

intent = np.random.choice(a = ['transfer', 'AA', 'credential'], size = 1_000_000, p = [.6, .2, .2])

sex = np.random.choice(a = ['male', 'female'], size = 1_000_000, p = [.49, .51])
first_time = np.random.choice(a = ['first', 'returning'], size = 1_000_000, p = [.3, .7])

# the following will return ~one million different 'ids' but not exactly one million if some numbers are repeated by chance
# that is ok though
ids = np.random.randint(low = 100_000, high = 999_999, size = 1_000_000,) 

age = np.random.randint(low = 18, high = 65, size = 1_000_000,)

race_ethnicity = np.random.choice(a = ['asian', 'white', 'latino', 'black', 'other'], size = 1_000_000, p = [0.15, 0.35, 0.38, 0.10, 0.02])

full_part = np.random.choice(a = ['full-time', 'part-time'], size = 1_000_000, p = [0.3158, 0.6842]) # 2013 stats according to IPEDS 

In [31]:
data = pd.DataFrame(
    data = 
    {
        'ids': ids, 
        'sex': sex, 
        'age': age, 
        'first_time_status': first_time, 
        'intent': intent, 
        'race': race_ethnicity,
        'full_time': full_part,
    }, 
)

data.head()

Unnamed: 0,ids,sex,age,first_time_status,intent,race,full_time
0,677074,female,82,first,transfer,latino,full-time
1,344008,female,18,first,transfer,white,full-time
2,165437,male,52,returning,AA,latino,part-time
3,501411,male,21,returning,transfer,latino,part-time
4,983007,male,63,returning,transfer,black,part-time


In [4]:
print(f"Only {data['ids'].nunique() / len(data['ids']):.2%} of ids are unique. Meaning, we have {data['ids'].nunique():,.0f} students in total.")

Only 60.36% of ids are unique. Meaning, we have 603,584 students in total.


We need to then delete those extra instances of the ID's (de-duplicate them).

In [5]:
data.drop_duplicates(subset = 'ids', keep = 'first', inplace = True)

### TODO:
1. create class course's data 
2. merge both datasets into 1 big ~10 mill dataset.

IDEA:
Create a 10 million row series choosing from `data['ids'].unique()` and then `pd.merge(10mil_ids, data)` then then add course features. 

In [6]:
big_data = np.random.choice(a = data['ids'].unique(), size = 10_000_000)

big_data = pd.DataFrame(big_data)
big_data.columns = ['ids']

big_data = pd.merge(big_data, data, how = 'left')

print(big_data.shape)
print(big_data.head())

(10000000, 6)
      ids     sex  age first_time_status      intent    race
0  128781  female   30         returning    transfer  latino
1  202009  female   51             first    transfer  latino
2  103152  female   81             first  credential   white
3  356309  female   62             first    transfer   white
4  244031  female   62             first          AA  latino


### Adding CCC

source to this table: https://en.wikipedia.org/wiki/List_of_California_Community_Colleges_by_enrollment

In [37]:
ccc = pd.read_csv("../data/raw/list_CCC.csv")

ccc.head()

Unnamed: 0,Ranking,College,Total enrollment,Full-time enrollment,Part-time enrollment
0,1,East Los Angeles College,36606,7090,29516
1,2,Santa Monica College,29999,10720,19279
2,3,American River College,29701,7560,22141
3,4,Santa Ana College,28598,3435,25163
4,5,Mount San Antonio College,28481,10499,17982


In [39]:
ccc['share_of_pop'] = ccc['Total enrollment'] / ccc['Total enrollment'].sum()

ccc.head()

Unnamed: 0,Ranking,College,Total enrollment,Full-time enrollment,Part-time enrollment,share_of_pop
0,1,East Los Angeles College,36606,7090,29516,0.024851
1,2,Santa Monica College,29999,10720,19279,0.020366
2,3,American River College,29701,7560,22141,0.020164
3,4,Santa Ana College,28598,3435,25163,0.019415
4,5,Mount San Antonio College,28481,10499,17982,0.019335


In [42]:
college = np.random.choice(a = ccc['College'], size = 10_000_000, p = list(ccc['share_of_pop']))

big_data['college'] = college

big_data.head()

Unnamed: 0,ids,sex,age,first_time_status,intent,race,college
0,128781,female,30,returning,transfer,latino,Mt. San Jacinto College
1,202009,female,51,first,transfer,latino,Santa Monica College
2,103152,female,81,first,credential,white,Laney College
3,356309,female,62,first,transfer,white,Palomar College
4,244031,female,62,first,AA,latino,Riverside City College


In [None]:
np.random.normal(loc = 30, scale = 15, size = 10)