<a href="https://colab.research.google.com/github/david-j-cox/twitter-higher-ed/blob/main/basic_analyses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Packages and modules

Coding is sped up by using code and algorithms that others have already written so we don't have to reinvent the wheel. This next box imports some of the pre-existing we'll use. 

In [2]:
# Packages for path management
import os

# Packages for manipulating data
import numpy as np
from numpy import std, mean, sqrt
import pandas as pd

# Packages for visualizing data
import seaborn as sns 
import matplotlib.pyplot as plt
try:
  import sweetviz as sv
except:
    !pip install sweetviz
    import sweetviz as sv

# Packages for statistics and modeling
from scipy import stats
try:
  from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
except:
  !pip install vaderSentiment

# Read in raw data set

In [3]:
# Set the working directory to the corresponding repository
directory = '/Users/davidjcox/Dropbox (Personal)/Projects/Manuscripts In Progress/Empirical/Endicott Data/diversity-survey/'
os.chdir(directory)

# Read in data and make a copy so we don't accidentally damage the raw file
raw_data = pd.read_csv('./data/01_raw/raw_data.csv')
data = raw_data.copy()

In [9]:
# Take a quick peek at every 100th row of the dataframe to make sure nothing looks off
data[::100]

Unnamed: 0,Time,Consent,Q1 of 20. What is your age?,Q2 of 20. With which gender do you identify?,Q3 of 20. What pronouns do you prefer?,Q4 of 20. With which ethnicity/race do you identify? Please list all that apply.,"Q5 of 20. With which religion do you identify? Please list all that apply, including if you do not identify with a religion.",Q6 of 20. With which political affiliation do you identify?,"Q7 of 20. What is the highest level of education you have completed? (If currently enrolled, please select highest degree received):",Please specify other terminal degree,...,Q18 of 20. Please rate your familiarity with the following terms: (Judaism),Q18 of 20. Please rate your familiarity with the following terms: (No religion (Atheist)),Q18 of 20. Please rate your familiarity with the following terms: (Paganism),Q18 of 20. Please rate your familiarity with the following terms: (Wiccan),Q19 of 20. What actionable items do you recommend we take as a field to increase diversity?,Q20 of 20. Please include any comments or feedback on how the demographic section of this survey can be improved to meet its goal of inclusivity and diversity.,Browser,IP Address,Unique ID,Location
0,10/21/20 12:30,"Yes, I am at least 18 years of age, am a stude...",46-60,Male,He/him/his,Jewish,,Libertarian,Doctoral degree,,...,Very familiar,Very familiar,Very familiar,Very familiar,"None, treat everyone as an individual and not ...",,Chrome 86.0.4240.75 / Windows,100.12.9.146,682348249,"40.620098114014, -73.751899719238"
100,10/28/20 12:27,"Yes, I am at least 18 years of age, am a stude...",31-45,male,He/him/his,White,,Democratic Socialist/Democratic Party,Master's degree,,...,Very familiar,Very familiar,Somewhat familiar,Somewhat familiar,Invest in recruiting marginalized population...,,Safari 13.1.2 / OS X,74.85.93.162,688774433,"47.614498138428, -122.34799957275"
200,3/9/21 17:30,"Yes, I am at least 18 years of age, am a stude...",31-45,Male,He/him/his,Caucasion,I do not identify with a religion or any spiri...,Democrat/moderate liberal,Bachelor's degree,,...,Somewhat familiar,Very familiar,Very familiar,Somewhat unfamiliar,I think the field of ABA is new enough that ma...,,Chrome 88.0.4324.182 / Windows,205.118.194.18,776160429,"40.498199462891, -111.84359741211"
300,3/9/21 23:45,"Yes, I am at least 18 years of age, am a stude...",18-30,Female,She/her/hers,Mixed White and Black Caribbean,,UK - left wing Green Party/labour,Master's degree,,...,Very familiar,Very familiar,Somewhat familiar,Somewhat familiar,Ensure that everyone gets 'a seat at the table...,It was great that fields were left open so I c...,Mobile Safari 12.1.2 / iOS,94.204.111.63,776275507,"24, 54"
400,3/20/21 19:01,"Yes, I am at least 18 years of age, am a stude...",31-45,Female,She/her/hers,White,,Independent,Master's degree,,...,Very familiar,Very familiar,Very familiar,Very familiar,Looking at socioeconomic background and making...,,Mobile Safari 14.0.3 / iOS,69.207.101.121,781294861,"43.224601745605, -77.592002868652"


## When programming, we often have to specify the column we're interested in looking at. The current column headers are very wordy. We'll rename the column headers to make them easier to work with. 

In [13]:
# Take a look at the existing column headers
list(data)

['Time',
 'Consent',
 'Q1 of 20. What is your age?',
 'Q2 of 20. With which gender do you identify?',
 'Q3 of 20. What pronouns do you prefer?',
 'Q4 of 20. With which ethnicity/race do you identify? Please list all that apply.',
 'Q5 of 20. With which religion do you identify? Please list all that apply, including if you do not identify with a religion. ',
 'Q6 of 20. With which political affiliation do you identify?',
 'Q7 of 20. What is the highest level of education you have completed? (If currently enrolled, please select highest degree received):',
 'Please specify other terminal degree',
 'Q8 of 20. What is your annual household income?',
 'Q9 of 20. Do you identify as having a disability?',
 'If yes and you feel comfortable sharing, please specify:',
 'Q10 of 20. Please select applicable certification and/or licenses (select all that apply):',
 'Q11 of 20. How long have you identified as being in the field of behavior analysis?',
 'Q12 of 20. How long have you been certified an

In [14]:
# Use that text to create shorthand labels
data.columns = ['time', 
               'consent', 
               'age', 
               'gender', 
               'pronouns', 
               'ethnicity', 
               'religion', 
               'political_affil', 
               'education', 
               'degree_name', 
               'income', 
               'disability', 
               'disability_name', 
               'cert_license', 
               'time_in_bx_anal', 
               'time_cert_license', 
               'country_live', 
               'province_live', 
               'state_live',
               'country_practice', 
               'province_practice', 
               'state_practice',
               'define_diversity', 
               'familiar_cis_fem', 
               'familiar_cis_male', 
               'familiar_trans_fem', 
               'familiar_trans_male', 
               'familiar_non_binary', 
               'familiar_gender_fluid', 
               'familiar_gender_neutral', 
               'familiar_asian', 
               'familiar_black', 
               'familiar_white', 
               'familiar_latinx', 
               'familiar_nat_amer', 
               'familiar_pac_isl', 
               'familiar_agnostic', 
               'familiar_amish', 
               'familiar_buddhism', 
               'familiar_christian', 
               'familiar_hinduism', 
               'familiar_islam', 
               'familiar_jehovah', 
               'familiar_judaism', 
               'familiar_atheist', 
               'familiar_paganism', 
               'familiar_wiccan', 
               'actions_recommend', 
               'survey_feedback', 
               'browser',
               'ip_address',
               'unique_id',
               'location']

In [17]:
# Save it for future quick use
data.to_csv('./data/03_primary/clean_headers.csv')

## Sweetviz for quick EDA on dataframe

In [20]:
# Run the report
report = sv.analyze(data)

# Show the report
report.show_html(filepath='./figures/sweetviz_report.html', 
                     open_browser=True, 
                     layout='widescreen', 
                     scale=None)

                                             |          | [  0%]   00:00 -> (? left)

Report ./figures/sweetviz_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


# Topic modeling

In [18]:
# Convert the text data we're interested in to lowercase
data['actions_recommend'] = data['actions_recommend'].str.lower()