**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Mihir Joshi
- Kanishk Hari
- Karina Shah
- Arpita Pandey
- Sahithi Josyam

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



What is the relationship between unemployment rates of specific job titles and the adaptation of AI (integration of AI-driven automation, decision-making, and augmentation tools within industry sectors) across industries in the U.S. from 2010 to 2023?

## Background and Prior Work


The rise of generative AI tools like ChatGPT, Gemini, and Github Copilot have caught global attention through their advancements in machine learning.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)  With it's rapid growth, their have been talks on how to utilize AI in various industries.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)  Thus, through our research, we hope to dvelve into how artificial intelligence is affecting employment rates in these various industries in the past few years. 


Generative AI has shown potential for immense economic growth. In a report done by Mkinsey & Company, it was discovered that Generative AI could add an estimated 2.6 trillion to 4.6 trillion dollars to the global economy.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) But, this growth isn't uniform across industries. Looking into various industries, research shows that AI could fall short in 4 major areas: customer operations, marketing and sales, software engineering, and R&D.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Therefore, it is important to further examine the affects of AI, in various industries, as well as the job opportunities that come with. 

With the innovations of Gen AI there are growing concerns on its affects on job displacement.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) A recent study by Goldman Sachs states that Generative AI could automate up to 300 million jobs in the US and Europe.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Due to AI's ability to automate certain tasks and daily cognitive processes, there is concerns on it taking over various vocations.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Another study showed that after the increase in Generative AI tools, there was a 21% decrease in the number of weekly job postings on automation prone jobs vs. more manual intensive jobs.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Through these studies, we see that many jobs have a potential of being taken up by Generative AI, but we hope to further dvelve into what industries and job titles seem most at risk of job displacement, and which industries still have potential for job growth.




1. <a name="cite_note-1"></a> [^](#cite_ref-1) Chui, Michael, et al. The Economic Potential of Generative AI, June 2023, [https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier#introduction]
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Rege, Manjeet, and Hemachandran K. “Tommie Experts: Generative AI’s Real-World Impact on Job Markets - Newsroom: University of St. Thomas.” Newsroom | University of St. Thomas, 4 Nov. 2024, [https://news.stthomas.edu/generative-ais-real-world-impact-on-job-markets/]
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Demirci, Hannane, and Xinrong Z. “Research: How Gen AI Is Already Impacting the Labor Market” Harvard Business Review, 11 Nov. 2024, [https://hbr.org/2024/11/research-how-gen-ai-is-already-impacting-the-labor-market]


# Hypothesis


We predict that higher rates of AI adoption—defined as the integration of AI-driven automation, decision-making, and augmentation tools within industry sectors—will be associated with variations in employment patterns. Specifically, industries with higher AI adoption may experience shifts in workforce demand, where some traditional roles decline while new AI-related opportunities emerge. The extent to which AI adoption impacts employment will likely depend on industry investment in retraining and upskilling programs, with sectors that prioritize workforce adaptation experiencing more stable employment outcomes over time.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset #1 - AI Job Data

Dataset Name: AI Job Data

Dataset Link: https://www.kaggle.com/datasets/manavgupta92/from-data-entry-to-ceo-the-ai-job-threat-index

There are 4706 observations

There are 6 variables that are included in the data. (Job titiles, AI Impact, Tasks, AI models, AI_Workload_Ratio, Domain)

The AI Job Data set that we are using has information about the unique job titles, the AI impact on the role as a percentage, the tasks that they do on a daily basis, how many models are capable of completing some of the tasks, the workload that the AI does for each role, and the domain that the people work in. We would want to extract more features from the data like the field that they work in and the level that they are in the organization. We can determine this from the domain and the job title that they have. We would need to remove the percentages to make the data more accessible and likely sort the domains and roles into more digestable buckets.

Job titiles:
Names of various job roles spanning different industries.


AI Impact:
Percentage representation of AI's influence on the respective job title.


Tasks:
Numerical count of human-performed tasks associated with each job title.


AI models:
Count of AI models or systems implemented or associated with the job role.


AI_Workload_Ratio:
A computed ratio representing the workload distribution between tasks and AI models.


Domain:
The broader category or industry to which the job title belongs.

In [4]:
import pandas as pd
df = pd.read_csv("AI_Job_Data.csv")

In [5]:
df['is_manager_or_director'] = df['Job titiles'].str.lower().str.contains('manager|director').astype(int)
df['AI Impact'] = df['AI Impact'].str.rstrip('%').astype(float)
df

Unnamed: 0,Job titiles,AI Impact,Tasks,AI models,AI_Workload_Ratio,Domain,is_manager_or_director
0,Communications Manager,98.0,365,2546,0.143362,Communication & PR,1
1,Data Collector,95.0,299,2148,0.139199,Data & IT,0
2,Data Entry,95.0,325,2278,0.142669,Administrative & Clerical,0
3,Mail Clerk,95.0,193,1366,0.141288,Leadership & Strategy,0
4,Compliance Officer,92.0,194,1369,0.141709,Medical & Healthcare,0
...,...,...,...,...,...,...,...
4701,Singer,5.0,686,2798,0.245175,Data & IT,0
4702,Airport,5.0,556,2206,0.252040,Administrative & Clerical,0
4703,Director,5.0,1316,4695,0.280298,Leadership & Strategy,1
4704,Nurse,5.0,710,2594,0.273709,Medical & Healthcare,0


In [6]:
df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,AI Impact,Tasks,AI models,AI_Workload_Ratio,is_manager_or_director
count,4706.0,4706.0,4706.0,4706.0,4706.0
mean,30.31258,400.708032,1817.678071,inf,0.123884
std,18.203777,311.564781,1086.853037,,0.329485
min,5.0,1.0,0.0,0.036585,0.0
25%,15.0,161.0,1085.25,0.137271,0.0
50%,25.0,270.0,1577.5,0.199281,0.0
75%,40.0,608.75,2273.0,0.260572,0.0
max,98.0,1387.0,5666.0,inf,1.0


## Dataset #2 - Unemployment Rate Per Industry

Dataset Name: Unemployed persons by industry and class of worker

Dataset Link: https://www.bls.gov/webapps/legacy/cpsatab14.htm#

The linked dataset above has the capability to include the information that we are seeking. To retrieve a dataset for the unemployment rate per industry, all industries listed were selected and specified to retrieve unemployment rate data monthly from 2010 to 2023. As data for each industry was separated, resulting in numerous different Excel spreadsheets, these spreadsheets were then combined into a single sheet where the averages of each year were taken for each industry. 

We will be using this final IndustryUnemployment.csv file containing the averages for each year per industry for further analysis. 

The IndustryUnemployment dataset contains information about the following industries: Nonagriculture Industries, Manufacturing, Wholesale and Retail Trade, Transportation and Utilities, Information, Financial Activities, Professional and Business Services, Education and Health Sercies, Leisure and Hospitality, Other Services and Agriculture, and Related Industry. 

In [8]:
#unemployment rate per industry dataframe: df_2
df_2 = pd.read_csv("IndustryUnemployment.csv")
df_2

Unnamed: 0,Year,Nonagriculture Industries,Manufacturing,Wholesale and Retail Trade,Transportation and Utilities,Information,Financial Activities,Professional and Business Services,Education and Health Services,Leisure and Hospitality,Other Services,Agriculture and Related Industry
0,2010,9.9,10.2,9.5,7.9,9.7,6.9,10.9,5.8,12.2,8.6,13.9
1,2011,9.0,8.8,9.0,8.1,7.3,6.4,9.6,5.6,11.6,8.8,12.7
2,2012,7.9,7.1,8.1,6.9,7.6,5.1,8.9,5.6,10.4,7.2,12.6
3,2013,7.2,6.4,7.3,6.4,6.2,4.5,8.3,4.9,10.0,6.9,10.2
4,2014,5.9,4.7,6.1,5.5,5.2,4.0,6.9,4.2,8.6,5.7,9.6
5,2015,5.1,4.2,5.5,4.2,3.9,2.6,5.6,3.6,7.9,5.2,9.5
6,2016,4.7,4.2,5.0,4.1,4.6,2.7,5.1,3.3,6.8,4.4,8.5
7,2017,4.2,3.5,4.6,3.9,4.5,2.4,4.6,3.0,6.1,3.8,7.2
8,2018,3.8,3.3,4.4,3.3,3.7,2.2,3.9,2.7,5.7,3.4,7.3
9,2019,3.5,2.9,4.1,3.4,3.5,2.1,3.6,2.5,5.2,3.2,7.4


In [9]:
df_2.describe()

Unnamed: 0,Year,Nonagriculture Industries,Manufacturing,Wholesale and Retail Trade,Transportation and Utilities,Information,Financial Activities,Professional and Business Services,Education and Health Services,Leisure and Hospitality,Other Services,Agriculture and Related Industry
count,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0
mean,2016.5,5.871429,5.192857,6.157143,5.592857,5.378571,3.557143,6.178571,3.964286,8.971429,5.621429,8.971429
std,4.1833,2.210266,2.377285,1.941734,2.149942,2.015776,1.636466,2.430631,1.305967,3.931613,2.319874,2.600718
min,2010.0,3.5,2.9,4.1,3.3,3.0,1.9,3.5,2.4,5.2,3.1,5.6
25%,2013.25,3.9,3.35,4.45,4.025,3.75,2.25,4.075,2.775,5.875,3.5,7.325
50%,2016.5,5.25,4.2,5.7,4.85,4.9,2.8,5.4,3.55,8.25,5.3,8.2
75%,2019.75,7.725,6.925,7.9,6.775,7.025,4.375,7.95,5.425,10.325,7.125,10.05
max,2023.0,9.9,10.2,9.5,10.5,9.7,6.9,10.9,5.8,19.8,9.9,13.9


# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

Within our two datasets, our first dataset "From Data Entry to CEO: The AI Job Threat Index" has no personal data included, so there is no concern for privacy regarding this dataset. Within our second dataset, "US Monthly Unemployment Rate 1948 - Present", there is also no personal data included, as this is just a survey of unemployment rates per month, so there is no privacy concerns as well. However, bias concerns exist within the first dataset. We discovered that the owner of the dataset scrapped the data from multiple "reputable" sources, however, these sources utilized to scrape the data were not mentioned. As such, it is difficult to pinpoint where exactly the data came from for all of the data, and this is something we will have to closely monitor and address if we find the original source. Moreover, our second dataset has data until 2019, so we would need to find another source that gives us the unemployment rate per month from 2020 to present. Since the data we deal with is mostly regarding the efficiency of AI in jobs and how well it has done a certain job, numeric data will be the primary focus of this project, and as such, forging of numeric data is very common in order to make sure that a certain bias exists within job displacement due to artificial intelligence. To handle these issues, we will cross-reference the data in the dataset with information online (ideally, we would try to find the original data, but we will gather information from multiple sources to verify this information). Other than this, there are no other issues related to topic area, data, and analyses regarding data privacy and equitable impact, as the data merely shows how impacted certain roles were in correspondance to unemployment rates.

# Team Expectations 


After reading COGS 108 Team Policies, we have set the following as our team's expectations: 

* Communication: Communication among team members will take place virtually on Discord and in-person as needed. We will meet weekly or as-needed on Thursdays at 6:30. To ensure that all view points are heard and valued, our team will be "blunt but polite". Additionally, decisions will be made by majority-vote.
* Responsibility: All work will be split up equally. This being said, the work assigned to an individual also allows for collaboration among others to ensure that every individual is able to contribute (e.g. not one person will do all the coding work). If a team member needs assistance with a task, then there will be no hesitancy in providing help. Tasks will be divided by comfortability of an individual handling it but can be rotated weekly such that each team member can try something new.  
* Team members will be aligned as closely as possible with the plan. If anybody forsees an issue arising, then they will notify the others as soon as possible so everybody ensures proper preparation. 

# Project Timeline Proposal

Team's anticipated timeline for the project: 

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/28  |  5 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis for the final topic and begin the initial background research | 
| 2/4  |  5 PM |  Do background research on topic | Discuss ideal dataset(s) fitting for our topic; start the initial project proposal draft| 
| 2/9| 5 PM | Refine the project proposal finally; Search for datasets  | Submit the proposal; Begin discussing wrangling and possible approaches for data analysis;Assign group members different parts   |
| 2/13  | 6:30 PM  | Finalize datasets; Import & Wrangle Data (Mihir Joshi & Sahithi Josyam); Begin EDA (Sahithi Josyam & Mihir Joshi); Review Project Proposal Feedback and make corrections| Review/Edit wrangling/EDA; Discuss potential concerns with datasets collected; Discuss plan for analysis   |
| 2/23  | 4:30 PM  | Finalize wrangling/EDA; Begin Analysis (Arpita Pandey; Karina Shah); Submit DataCheckpoint| Refine the current analysis for topic and complete project check-in! |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Kanishk Hari)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |