# Data Conversion
There is Question/Answer (QA) data inside of `STACIA_QA`.

The goal of this notebook is to read all of the csv files into a `pandas.DataFrame`

Then reformat the data to fit the CSAI style.

Finally, output a CSV file of the newly structured data. 

In [1]:
from os import listdir, getcwd
from os.path import join
from RedirectStdStreams import RedirectStdStreams
from io import StringIO
import pandas as pd
import re
import numpy as np
import pandas_profiling

In [2]:
DATA_DIR = 'STACIA_QA'
DIR = join(getcwd(), DATA_DIR)

In [3]:
files = listdir(DATA_DIR)

In [4]:
def startswith(y):
    return lambda x: x.startswith(y)

In [5]:
cs_files = sorted(filter(startswith('cs'), files))
stat_files = sorted(filter(startswith('stat'), files))
club_files = sorted(filter(startswith('club'), files))

In [6]:
def make_data_frame_from_files(files):
    '''given stacia_qa csv files, return a pandas DataFrame containing the enclosed data'''
    # encodings: https://stackoverflow.com/q/19699367/5411712
    # list-of-encodings:
    # https://docs.python.org/3/library/codecs.html#standard-encodings
    df = pd.DataFrame()
    for fname in files:
        fpath = join(DIR, fname)
        to_concat = [df]
        to_concat.append(
            pd.read_csv(fpath,
                        sep='|',
                        encoding='latin_1',
                        header=None,
                        # fun fact: setting the `names` param avoids warnings
                        # because then pandas knows how many columns to make
                        # also the purpose of the "other" columns is to catch weird 
                        # extras that would otherwise be trimmed off after the delimeter
                        names=['id','q_format','a_format','notes','other','other2','other3'],
                        error_bad_lines=False,
                        warn_bad_lines=True))
        df = pd.concat(to_concat)
        print(fname, "done", "....on to the next one....")
    return df

In [7]:
cs_df   = make_data_frame_from_files(cs_files)
stat_df = make_data_frame_from_files(stat_files)
club_df = make_data_frame_from_files(club_files)

cs1.txt done ....on to the next one....
cs10.txt done ....on to the next one....
cs11.txt done ....on to the next one....
cs12.txt done ....on to the next one....
cs13.txt done ....on to the next one....
cs14.txt done ....on to the next one....
cs15.txt done ....on to the next one....
cs2.txt done ....on to the next one....
cs3.txt done ....on to the next one....
cs4.txt done ....on to the next one....
cs5.txt done ....on to the next one....
cs6.txt done ....on to the next one....
cs7.txt done ....on to the next one....
cs8.txt done ....on to the next one....
cs9.txt done ....on to the next one....
stat1.txt done ....on to the next one....
clubs.txt done ....on to the next one....


In [8]:
# remove the comments at the end of each file 
cs_df = cs_df[cs_df['id'].str.startswith("[") != True]
stat_df = stat_df[stat_df['id'].str.startswith("[") != True]
club_df = club_df[club_df['id'].str.startswith("[") != True]

all_df = pd.concat([cs_df, stat_df, club_df])

## Let's have a quick look at `all_df`

In [9]:
print('all_df.shape:', all_df.shape)
print('So there are {} question/answer pairs'.format(all_df.shape[0]))
all_df.tail()

all_df.shape: (2202, 7)
So there are 2202 question/answer pairs


Unnamed: 0,id,q_format,a_format,notes,other,other2,other3
87,A2,What time does [CSCorSTAT] tutoring meet?,"[TutorTime] (if CSC, also insert note on how t...",,,,
88,A2,Who are some private tutors for Statistics?,Here is a PDF for a list of private Statistics...,,,,
89,A2,What do I need to do to be a CSC tutor?,Complete (or be enrolled in) CSC 357 and sched...,,,,
90,A2,Does [CSSESTATClubOrgName] have a [OfficerRole...,[YesOrNo],,,,
91,A2,Who is the club adviser for [CSSESTATClubOrgNa...,The club adviser for [CSSESTATClubOrgName] is ...,,,,


## Is there bad data?  `yes` 😢

In [10]:
def make_weird_columns_mask(df):
    '''given a pandas DataFrame of STACIA_QA data, return a mask to find the rows with extra columns'''
    wierd_columns_mask = df['notes'].notnull() 
    wierd_columns_mask |= df['other'].notnull() 
    wierd_columns_mask |= df['other2'].notnull() 
    wierd_columns_mask |= df['other3'].notnull()
    return wierd_columns_mask

In [11]:
wierd_columns_mask = make_weird_columns_mask(all_df)
needs_to_be_fixed = all_df[wierd_columns_mask]
print('needs_to_be_fixed.shape:', needs_to_be_fixed.shape)
print('So there are {} rows that do not follow the standard STACIA_QA format'.format(needs_to_be_fixed.shape[0]))

needs_to_be_fixed.shape: (84, 7)
So there are 84 rows that do not follow the standard STACIA_QA format


In [12]:
all_df[wierd_columns_mask].head()

Unnamed: 0,id,q_format,a_format,notes,other,other2,other3
30,G1,How many units of tech electives are required ...,Minor]?,Number,,,
32,G1,Do I have any Free Electives as [Major,Minor]?,Yes or No,,,
34,G1,How many units of Support Courses do I need to...,Minor]?,Number,,,
35,G1,How many units of Approved support courses do ...,Minor]?,Number,,,
36,G1,What CSC courses can I take as a [Major,Minor]?,[Courses],,,


In [13]:
all_df[wierd_columns_mask].tail()

Unnamed: 0,id,q_format,a_format,notes,other,other2,other3
106,A1,Is [Person] available as a private tutor for S...,(No.,Yes.),,,
107,A1,Are there any private tutors for [Stat-Course]?,(No.,"Yes. [Person] at [Email],â¦.",,,
108,A1,What is the contact information of private Sta...,(No one tutors for that course.,"The tutors are [Person] at [Email],â¦.",,,
42,A2,Whats [STAT,CSSE] tutoring office hours?,[OfficeHours],,,
43,A2,Can you reserve a tutor for [CSSE,STAT]?,"You can reserve one for STAT, not for CSSE",,,


## Frankly, `84` bad rows is so few that it might be reasonable to drop those rows

In [14]:
all_df[wierd_columns_mask != True][['id','q_format','a_format']].head()

Unnamed: 0,id,q_format,a_format
0,F2,What is [COURSE] about?,[COURSE] is about [COURSE_DESCRIPTION]
1,F2,What are the Mathematics/Statistics Support El...,The available Mathematics/Statistics Support E...
2,F2,What are the approved technical electives for ...,The available approved technical elective opti...
3,F2,Which courses will count as approved external ...,The available approved external elective optio...
4,F2,Is taking the GRE required for acceptance into...,"No, for [MAJOR] majors, students do not need t..."


In [15]:
all_df[wierd_columns_mask != True][['id','q_format','a_format']].tail()

Unnamed: 0,id,q_format,a_format
87,A2,What time does [CSCorSTAT] tutoring meet?,"[TutorTime] (if CSC, also insert note on how t..."
88,A2,Who are some private tutors for Statistics?,Here is a PDF for a list of private Statistics...
89,A2,What do I need to do to be a CSC tutor?,Complete (or be enrolled in) CSC 357 and sched...
90,A2,Does [CSSESTATClubOrgName] have a [OfficerRole...,[YesOrNo]
91,A2,Who is the club adviser for [CSSESTATClubOrgNa...,The club adviser for [CSSESTATClubOrgName] is ...


## Finally, some reasonable looking data!
### To be continued...

In [16]:
pandas_profiling.ProfileReport(all_df)

  phi2corr = max(0.0, phi2 - ((k - 1.0) * (r - 1.0)) / (n - 1.0))
  rcorr = r - ((r - 1.0) ** 2.0) / (n - 1.0)
  kcorr = k - ((k - 1.0) ** 2.0) / (n - 1.0)


