# Creating fake data for analysis

This notebook illustrates the methods behind the dataset we use in our analysis.


Goals:
1. Create a dataset similar in shape and content to that of COMIS (?)
  - one row for each course taken
      - student features
        - ids/ssn
        - demographics (age, sex)
        - student intent (transfer, AA, credential)
        - first time or returning
      - course features
        - course id
        - section id
        - units
        - grade in class
        - college id
        - department (Math, English)
        - dev ed (T/F)
      - term id (Year, Term)
2. Prepare data in similar fashion to that we receive
  - categoricals
  - fall-winter-spring-quarter system
3. Prepare a second dataset for __Enrollment__ (10 million rows)
  - columns: term, course name, course code, college id, units, credit/noncredit 

***
__Set up__

In [1]:
import pandas as pd
import numpy as np

We will end up with a dataset 10 million rows long but it should only have 1 million students (taking at most 10 classes).

First, the students: <br>
They must have an 1) id/ssn, 2) age, 3) sex, 4) intent, 5) first time status.

In [2]:
np.random.seed(41)

intent = np.random.choice(a = ['transfer', 'AA', 'credential'], size = 1_000_000, p = [.6, .2, .2])

sex = np.random.choice(a = ['male', 'female'], size = 1_000_000, p = [.49, .51])
first_time = np.random.choice(a = ['first', 'returning'], size = 1_000_000, p = [.3, .7])

# the following will return ~one million different 'ids' but not exactly one million if some numbers are repeated by chance
# that is ok though
ids = np.random.randint(low = 100_000, high = 999_999, size = 1_000_000,) 

age = np.random.randint(low = 18, high = 99, size = 1_000_000,)

race_ethnicity = np.random.choice(a = ['asian', 'white', 'latino', 'black', 'other'], size = 1_000_000, p = [0.15, 0.35, 0.38, 0.10, 0.02])

In [3]:
data = pd.DataFrame(data = {'ids': ids, 'sex': sex, 'age': age, 'first_time_status': first_time, 'intent': intent, 'race':race_ethnicity}, )

data.head()

Unnamed: 0,ids,sex,age,first_time_status,intent,race
0,677074,female,82,first,transfer,latino
1,344008,female,18,first,transfer,white
2,165437,male,52,returning,AA,latino
3,501411,male,21,returning,transfer,latino
4,983007,male,63,returning,transfer,black


In [4]:
print(f"Only {data['ids'].nunique() / len(data['ids']):.2%} of ids are unique. Meaning, we have {data['ids'].nunique():,.0f} students in total.")

Only 60.36% of ids are unique. Meaning, we have 603,584 students in total.


### TODO:
1. create class course's data 
2. merge both datasets into 1 big ~10 mill dataset.

IDEA:
Create a 10 million row series choosing from `data['ids'].unique()` and then `pd.merge(10mil_ids, data)` then then add course features. 

In [17]:
big_data = np.random.choice(a = data['ids'].unique(), size = 10_000_000)

big_data = pd.DataFrame(big_data)
big_data.columns = ['ids']

big_data = pd.merge(big_data, data, how = 'left')

print(big_data.shape)
print(big_data.head())

(16569400, 6)
      ids     sex  age first_time_status    intent    race
0  705086  female   77         returning  transfer  latino
1  705086    male   69         returning  transfer   black
2  705086    male   57         returning  transfer  latino
3  362145    male   81             first  transfer   black
4  362145    male   71         returning  transfer   white
