# Data Understanding

## Preperation

Import packages and set globals


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option("display.width", 1000)
pd.set_option("display.max_columns", None)
pd.options.display.float_format = "{:.2f}".format

%matplotlib inline
plt.rcParams["figure.figsize"] = (20, 6)

In [None]:
gdp_df = pd.read_pickle("../data/raw/gdp.pkl")
occ_df = pd.read_pickle("../data/raw/naics_occupation.pkl")
ptn_df = pd.read_pickle("../data/raw/naics_pattern.pkl")

Based on instruction given to us, we preemptivly drop uninteresting NAICS


In [None]:
naics_filter = "|".join(["^11", "^21", "^22", "^23", "^31", "^32", "^33"])

gdp_df = gdp_df.loc[gdp_df["IndustryClassification"].str.contains(naics_filter)]
occ_df = occ_df.loc[occ_df["naics"].str.contains(naics_filter)]
ptn_df = ptn_df.loc[ptn_df["naics"].str.contains(naics_filter)]

## Structure

Brief overview of the different datasets given to us

### Data Format


In [None]:
gdp_df.head(10)

In [None]:
occ_df.head(10)

In [None]:
ptn_df.head(10)

### Info


In [None]:
gdp_df.info()

In [None]:
occ_df.info()

In [None]:
ptn_df.info()

## Analysis

### Top Industries


In [None]:
top_industries = pd.DataFrame()

#### GDP

By 2022 gdp


In [None]:
# highest_gdp = gdp_df.groupby(["IndustryClassification", "Description"])[["2017", "2018", "2019", "2020", "2021", "2022"]].sum().reset_index()
# top_industries["2022_gdp"] = highest_gdp.sort_values(by="2022", ascending=False).reset_index()["IndustryClassification"]

Mean gdp from 2017 - 2022


In [None]:
# highest_gdp['mean_gdp'] = highest_gdp[["2017", "2018", "2019", "2020", "2021", "2022"]].mean(axis=1)
# top_industries["mean_gdp"] = highest_gdp.sort_values(by="mean_gdp", ascending=False).reset_index()["IndustryClassification"]

#### OCC

By employment


In [None]:
highest_employment = occ_df.groupby(["naics", "NAICS_TITLE"])[
    "emp_total_county_naics"
].sum()
top_industries["employment"] = highest_employment.sort_values(
    ascending=False
).reset_index()["naics"]

#### Pattern

By annual pay


In [None]:
highest_pay = ptn_df.groupby(["naics", "DESCRIPTION"])[["ap", "qp1", "emp"]].sum()
top_industries["ap"] = highest_pay.sort_values(by="ap", ascending=False).reset_index()[
    "naics"
]

By pay per person


In [None]:
highest_pay["ap_per_emp"] = highest_pay["ap"] / highest_pay["emp"]
top_industries["ap_per_emp"] = highest_pay.sort_values(
    by="ap_per_emp", ascending=False
).reset_index()["naics"]

By establishments


In [None]:
highest_est = ptn_df.groupby(["naics", "DESCRIPTION"])[["emp", "est"]].sum()
highest_est["emp_per_est"] = highest_est["emp"] / highest_est["est"]

top_industries["establishments"] = highest_est.sort_values(
    by="est", ascending=False
).reset_index()["naics"]

By employees per est


In [None]:
top_industries["emp_per_establishments"] = highest_est.sort_values(
    by="emp_per_est", ascending=False
).reset_index()["naics"]

#### Overview


In [None]:
top_industries.head(20)

**_Durable Goods_** is the most interesting branch, as Pferd-Werkzeuge creates product for this industry in specific. They also seem to have a high gdp, which makes them a better target for premium products.

The **_Construction_** branch could be interesting, as many of the product produces by Pferd, can also be used for woodworking etc. and they have a high number of employments, yet their gdp doesn't seem to follow suit. So we exclude them for now.

Picked NAICS:

1. 3364 Aerospace Product and Parts Manufacturing
2. 3363 Motor Vehicle Parts Manufacturing
3. 3330A1 Machinery Manufacturing
4. 3320A2 Fabricated Metal Product Manufacturing
5. 3261 Plastic Product Manufacturing


### Top Occupations


In [None]:
top_occupations = pd.DataFrame()

top_industries = ["3364", "3363", "3330A1", "3320A2", "3261"]

occ_df_filtered_naic1 = occ_df[occ_df["naics"] == top_industries[0]]
occ_df_filtered_rest = occ_df[occ_df["naics"].isin(top_industries)]

By employees


In [None]:
top_occupations = occ_df_filtered_naic1.groupby(["OCC_CODE", "OCC_TITLE"])[
    "emp_occupation"
].sum()
top_occupations = top_occupations.sort_values(ascending=False)

top_occupations.head(20)

In [None]:
top_occupations = occ_df_filtered_rest.groupby(["OCC_CODE", "OCC_TITLE"])[
    "emp_occupation"
].sum()
top_occupations = top_occupations.sort_values(ascending=False)

top_occupations.head(20)

Occupations that peak our interest are the ones, that would directly use our products (premium tools for metalworking)

1. 51-4072 Molding, Coremaking, and Casting Machine Setters, Operators, and Tenders, Metal and Plastic
2. 51-4121 Welders, Cutters, Solderers, and Brazers Machinists
3. 51-4031 Cutting, Punching, and Press Machine Setters, Operators, and Tenders, Metal and Plastic
4. 51-4081 Multiple Machine Tool Setters, Operators, and Tenders, Metal and Plastic
5. 17-2112 Industrial Engineers


In [None]:
top_occupations = ["51-4072", "51-4121", "51-4031", "51-4081", "17-2112"]

## Export


In [None]:
top_picks = pd.DataFrame()

top_picks["naics"] = top_industries
top_picks["occ"] = top_occupations

pd.to_pickle(top_picks, "../data/processed/top_picks.pkl")