Here I am:
Collecting Dataset Options in Python
Statistical Tests will be done in R

## Option 1: Occupations with the Largest Projected Increase in Jobs by Share of Women in the Occupation

https://www.dol.gov/agencies/wb/data/high-demand-occupations

1. Is 'Employment, 2022' dependent on occupation (can pick 3-4)? (Chi-squared test)
2. Are there more women ('Employment, 2022') working as Nurse practitioners compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
3. Are there more women ('Employment, 2022') working as Software developers compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
4. Are there more women projected ('Employment, 2032') to work as Nurse practitioners compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
5. Are there more women projected ('Employment, 2032') to work as Software developers compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
6. Is median annual wage dependent on occupation? (Chi-squared test)
7. Compare median annual wage between two occupations. (proporiton hypothesis testing)
8. Label coding careers vs. non-coding careers. Do women with coding careers have a higher median wage compared to non-coding careers? (means hypothesis testing, t-test or z-test)

In [39]:
import pandas as pd

women = pd.read_csv('OccupationswithMostProjectedGrowth.csv')
print(women.head())
print(women['Measure Names'].unique())
print(women['Occupation'].unique())

# 'Employment change, 2022-32' --> frequency (in thousands)
# 'Employment, 2022' --> frequency (in thousands)
# 'Employment, 2032' --> frequency (in thousands)
# 'Median annual wage 2022' --> $
# 'Percent employment change, 2022-32' --> percentage
# 'Percent women 2022'] --> percentage

                        Measure Names                  Occupation  \
0          Employment change, 2022-32    Accountants and auditors   
1                    Employment, 2022    Accountants and auditors   
2                    Employment, 2032    Accountants and auditors   
3             Median annual wage 2022    Accountants and auditors   
4  Percent employment change, 2022-32    Accountants and auditors   

  Measure Values  
0           67.4  
1        1,538.4  
2        1,605.8  
3         78,000  
4            4.4  
['Employment change, 2022-32' 'Employment, 2022' 'Employment, 2032'
 'Median annual wage 2022' 'Percent employment change, 2022-32'
 'Percent women 2022']
['  Accountants and auditors' '  Animal caretakers'
 '  Computer and information systems managers'
 '  Computer systems analysts' '  Construction laborers'
 '  Cooks, restaurant' '  Data scientists' '  Financial managers'
 '  First-line supervisors of food preparation and serving workers'
 '  General and operations

## Option 2: NYC Air Quality Dataset

https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r

1. Is the amount of nitrogen dioxide dependent on NYC borough? (Chi-squared test)
2. Is the amount of fine particles dependent on NYC borough? (Chi-squared test)
3. Is the amount of ozone dependent on NYC borough? (Chi-squared test)
4. Are there more asthma emergency visits in Bronx vs. New York (Manhattan) with more nitrogen dioxide? (mean hypothesis testing and proportion testing, t-test and/or z-test)
5. Are there more Asthma hospitalizations due to Ozone in Bronx vs. New York (Manhattan) (mean hypothesis and proprotion testing, t-test and/or z-test)

In [20]:
import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("c3uy-2p5r", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

#print(results_df.head())
print(results_df['name'].unique())
print(results_df['measure'].unique())






['Nitrogen dioxide (NO2)' 'Fine particles (PM 2.5)' 'Ozone (O3)'
 'Asthma emergency department visits due to PM2.5'
 'Annual vehicle miles traveled' 'Asthma hospitalizations due to Ozone'
 'Respiratory hospitalizations due to PM2.5 (age 20+)'
 'Boiler Emissions- Total SO2 Emissions']
['Mean' 'Estimated annual rate (under age 18)' 'million miles'
 'Estimated annual rate (age 18+)' 'Estimated annual rate'
 'Number per km2']


In [23]:
no2 = results_df[results_df['name']=='Nitrogen dioxide (NO2)']
print(no2.head())

  unique_id indicator_id                    name measure measure_info  \
0    172653          375  Nitrogen dioxide (NO2)    Mean          ppb   
1    172585          375  Nitrogen dioxide (NO2)    Mean          ppb   
2    336637          375  Nitrogen dioxide (NO2)    Mean          ppb   
3    336622          375  Nitrogen dioxide (NO2)    Mean          ppb   
4    172582          375  Nitrogen dioxide (NO2)    Mean          ppb   

  geo_type_name geo_join_id                      geo_place_name  \
0         UHF34         203  Bedford Stuyvesant - Crown Heights   
1         UHF34         203  Bedford Stuyvesant - Crown Heights   
2         UHF34         204                       East New York   
3         UHF34         103                  Fordham - Bronx Pk   
4         UHF34         104                Pelham - Throgs Neck   

           time_period               start_date data_value  
0  Annual Average 2011  2010-12-01T00:00:00.000      25.30  
1  Annual Average 2009  2008-12-01T0

In [25]:
fineparticle = results_df[results_df['name']=='Fine particles (PM 2.5)']
print(fineparticle.head())

    unique_id indicator_id                     name measure measure_info  \
381    173129          365  Fine particles (PM 2.5)    Mean       mcg/m3   
382    669692          365  Fine particles (PM 2.5)    Mean       mcg/m3   
383    212069          365  Fine particles (PM 2.5)    Mean       mcg/m3   
384    547517          365  Fine particles (PM 2.5)    Mean       mcg/m3   
385    173125          365  Fine particles (PM 2.5)    Mean       mcg/m3   

    geo_type_name geo_join_id                      geo_place_name  \
381         UHF34         203  Bedford Stuyvesant - Crown Heights   
382         UHF34         203  Bedford Stuyvesant - Crown Heights   
383         UHF34         204                       East New York   
384         UHF34         204                       East New York   
385         UHF34         103                  Fordham - Bronx Pk   

             time_period               start_date data_value  
381       Winter 2009-10  2009-12-01T00:00:00.000      10.30  
38

In [26]:
ozone = results_df[results_df['name']=='Ozone (O3)']
print(ozone.head())

    unique_id indicator_id        name measure measure_info geo_type_name  \
576    121693          386  Ozone (O3)    Mean          ppb            CD   
577    549453          386  Ozone (O3)    Mean          ppb            CD   
578    605356          386  Ozone (O3)    Mean          ppb            CD   
582    121649          386  Ozone (O3)    Mean          ppb            CD   
583    605312          386  Ozone (O3)    Mean          ppb            CD   

    geo_join_id                        geo_place_name  \
576         408     Hillcrest and Fresh Meadows (CD8)   
577         408     Hillcrest and Fresh Meadows (CD8)   
578         408     Hillcrest and Fresh Meadows (CD8)   
582         106  Stuyvesant Town and Turtle Bay (CD6)   
583         106  Stuyvesant Town and Turtle Bay (CD6)   

                         time_period               start_date data_value  
576  2-Year Summer Average 2009-2010  2009-06-01T00:00:00.000      27.58  
577                      Summer 2017  2017-0

In [27]:
ervisits = results_df[results_df['name']=='Asthma emergency department visits due to PM2.5']
print(ervisits.head())

#can do proportion hypothesis testing (z-test or t-test)

    unique_id indicator_id                                             name  \
766    518895          648  Asthma emergency department visits due to PM2.5   
767    628444          648  Asthma emergency department visits due to PM2.5   
768    518888          648  Asthma emergency department visits due to PM2.5   
769    518906          648  Asthma emergency department visits due to PM2.5   
770    518926          648  Asthma emergency department visits due to PM2.5   

                                  measure          measure_info geo_type_name  \
766  Estimated annual rate (under age 18)  per 100,000 children         UHF42   
767  Estimated annual rate (under age 18)  per 100,000 children         UHF42   
768  Estimated annual rate (under age 18)  per 100,000 children         UHF42   
769  Estimated annual rate (under age 18)  per 100,000 children         UHF42   
770  Estimated annual rate (under age 18)  per 100,000 children         UHF42   

    geo_join_id           geo_place_na

In [31]:
asthma = results_df[results_df['name']=='Asthma hospitalizations due to Ozone']
print(asthma)

#not usuable, only one observation for Bed-Stuy

     unique_id indicator_id                                  name  \
1143    151584          661  Asthma hospitalizations due to Ozone   

                              measure        measure_info geo_type_name  \
1143  Estimated annual rate (age 18+)  per 100,000 adults         UHF42   

     geo_join_id                      geo_place_name time_period  \
1143         203  Bedford Stuyvesant - Crown Heights   2009-2011   

                   start_date data_value  
1143  2009-01-01T00:00:00.000      13.50  


In [32]:
respiratory = results_df[results_df['name']=='Respiratory hospitalizations due to PM2.5 (age 20+)']
print(respiratory.head())

##can do proportion hypothesis testing (z-test or t-test)

     unique_id indicator_id  \
1160    131539          650   
1161    628535          650   
1162    518796          650   
1163    518814          650   
1164    628555          650   

                                                   name  \
1160  Respiratory hospitalizations due to PM2.5 (age...   
1161  Respiratory hospitalizations due to PM2.5 (age...   
1162  Respiratory hospitalizations due to PM2.5 (age...   
1163  Respiratory hospitalizations due to PM2.5 (age...   
1164  Respiratory hospitalizations due to PM2.5 (age...   

                    measure        measure_info geo_type_name geo_join_id  \
1160  Estimated annual rate  per 100,000 adults         UHF42         405   
1161  Estimated annual rate  per 100,000 adults         UHF42         103   
1162  Estimated annual rate  per 100,000 adults         UHF42         105   
1163  Estimated annual rate  per 100,000 adults         UHF42         305   
1164  Estimated annual rate  per 100,000 adults         UHF42         305

## Option 3: Occupations of Mothers vs. Non-mothers

https://www.dol.gov/agencies/wb/data/mothers-families/occupations-largestnumbermothers


In [34]:
mothers = pd.read_csv('Occupations-employing-largest-number-mothers-2021.csv')
print(mothers.head())

                                          Occupation  Total employed  \
0                                  Registered nurses       2146030.0   
1              Elementary and middle school teachers       1872384.0   
2                                     Other managers       1267616.0   
3  Secretaries and administrative assistants, exc...       1146996.0   
4                   Customer service representatives       1188845.0   

   Number of mothers  Number of non-mothers  Percent mothers  \
0          1151877.0               994153.0             53.7   
1          1058116.0               814268.0             56.5   
2           594585.0               673031.0             46.9   
3           566257.0               580739.0             49.4   
4           544073.0               644772.0             45.8   

   Percent non-mothers  Mothers' representation gap (% mothers-%non-mothers)  
0                 46.3                                                7.3     
1                 43.5  

## Option 4: Contraception Dataset

In [16]:
import pandas as pd
import numpy as np

df = pd.read_csv('VIZ5_September_Contraceptive_Use_dataset.csv',  encoding='latin-1')

df['categories'] = df['Pregnancy intention'] + ", " + df['Contraceptive availability'] + ", " + df['Contraceptive method']
#print(df)

combination = df['categories'].unique()
print(df.head())
print(combination)

  Continent   Sub-Continent  Country  \
0    Africa  Eastern Africa  Burundi   
1    Africa  Eastern Africa  Burundi   
2    Africa  Eastern Africa  Burundi   
3    Africa  Eastern Africa  Burundi   
4    Africa  Eastern Africa  Comoros   

   Percentage distribution of women aged 15-49  (per country)  \
0                                           0.650838            
1                                           0.113709            
2                                           0.024347            
3                                           0.211106            
4                                           0.666126            

             Pregnancy intention Contraceptive availability  \
0  Not wanting to avoid pregancy             Not applicable   
1     Wanting to avoid pregnancy                   Met need   
2     Wanting to avoid pregnancy                 Unmet need   
3     Wanting to avoid pregnancy                 Unmet need   
4  Not wanting to avoid pregancy             Not appl

There are 4 categories with distributions
1. 'Not wanting to avoid pregancy, Not applicable, Not applicable'
2. 'Wanting to avoid pregnancy, Met need, Using modern methods'
3. 'Wanting to avoid pregnancy, Unmet need, Using traditional methods'
4. 'Wanting to avoid pregnancy, Unmet need, Using no method'

This dataset also uses dataset distributions. Would it be possible to convert distributions to frequencies?
(I don't think is a feasible option unless we can convert the distributions to frequencies.)

Are there more people who want to avoid pregnancy than people wanting to avoid pregnancy? (t test)
The category is dependent on country? (chi squared test)
The category is dependent on subcontient? (chi squared test)
The category is dependent on continent? (chi squared test)