# Time Series Analysis of NAICS

<img src="img.svg.png">

The North American Industry Classification System or NAICS is a classification of business establishments by type of economic activity. 
It is used by government and business in Canada, Mexico, and the United States of America.

It has largely replaced the older Standard Industrial Classification (SIC) system, except in some government agencies, such as the U.S. Securities and Exchange Commission (SEC).

An establishment is typically a single physical location, though administratively distinct operations at a single location may be treated as distinct establishments. 
Each establishment is classified to an industry according to the primary business activity taking place there. NAICS does not offer guidance on the classification of enterprises (companies) which are composed of multiple establishments.

NAICS is designed to provide common definitions of the
industrial structure of the three countries and a common statistical framework to facilitate the
analysis of the three economies.

## The data provided contains:

<b>a- Raw data:</b><br>
15 CSV files beginning with RTRA.<br>
These files contain employment data by industry at different levels of aggregation; 2-digit NAICS, 3-digit NAICS, and 4-digit
NAICS. <br>
<ul>
Columns mean as follows:
    <li>
 SYEAR: Survey Year
        </li>
    <li>
 SMTH: Survey Month
        </li>
    <li>
 NAICS: Industry name and associated NAICS code in the bracket
        </li>
    <li>
 _EMPLOYMENT_: Employment
        </li>
</ul>

<b>b- LMO Detailed Industries by NAICS:</b><br> 
An excel file for mapping the RTRA data to the desired data. <br>
The first column of this file has a list of 59 industries that are frequently used.<br>
The second column has their NAICS definitions. <br>
Using these NAICS definitions and RTRA data, you would create a monthly employment data series from 1997 to 2018 for these 59
industries.


<b>c- Data Output Template:</b><br>
An excel file with an empty column for employment. 

## Task

In this task, we need to understand how the NAICS works as a hierarchical structure for defining industries at different levels of aggregation. <br>

<b>For example:</b> <br>
In NAICS 2017 – Statistics Canada.pdf (see page 22), a 2-digit NAICS industry (e.g., 23 - Construction) is
composed of some 3-digit NAICS industries (236 - Construction of buildings, 237 - Heavy
and civil engineering construction, and a few more 3-digit NAICS industries).<br>

Similarly, a 3-digit NAICS industry (e.g., 236 - Construction of buildings), is composed of
4-digit NAICS industries (2361 - Residential building construction and 2362 -Non-residential building construction).

##  Get, and prepare the Dataset:

### a- Loading and exploring the LMO_Detailed_Industries_by_NAICS data:

In [15]:
# import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as snb
import matplotlib.pyplot as plt

In [16]:
# Loading LMO_Detailed_Industries_by_NAICS data
LMO_Detailed_Industries_df = pd.read_excel("Data\LMO_Detailed_Industries_by_NAICS.xlsx")
LMO_Detailed_Industries_df.head()

Unnamed: 0,LMO_Detailed_Industry,NAICS
0,Farms,111 & 112
1,"Fishing, hunting and trapping",114
2,Forestry and logging,113
3,Support activities for agriculture and forestry,115
4,Oil and gas extraction,211


In [17]:
# Create a list of NAICS for industries
list_NAICS = LMO_Detailed_Industries_df['NAICS'].astype(str).str.replace(' &', ',').str.split(', ')

In [18]:
list_NAICS.head()

0    [111, 112]
1         [114]
2         [113]
3         [115]
4         [211]
Name: NAICS, dtype: object

In [19]:
LMO_Detailed_Industries_df["list_NAICS"] = list_NAICS

In [20]:
LMO_Detailed_Industries_df.head()

Unnamed: 0,LMO_Detailed_Industry,NAICS,list_NAICS
0,Farms,111 & 112,"[111, 112]"
1,"Fishing, hunting and trapping",114,[114]
2,Forestry and logging,113,[113]
3,Support activities for agriculture and forestry,115,[115]
4,Oil and gas extraction,211,[211]


### b- Loading and exploring the 2-Digit NAICS Industries data:

In [21]:
# All the csv files have the same columns

In [22]:
# Get the data of 2digit NAICS industries
df_2_NAICS = pd.concat(map(pd.read_csv, ['Data/RTRA_Employ_2NAICS_00_05.csv', 'Data/RTRA_Employ_2NAICS_06_10.csv',
                                         'Data/RTRA_Employ_2NAICS_11_15.csv', 'Data/RTRA_Employ_2NAICS_16_20.csv',
                                          'Data/RTRA_Employ_2NAICS_97_99.csv']))

In [23]:
df_2_NAICS.head()

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_
0,2000,1,Accommodation and food services [72],148000
1,2000,1,"Administrative and support, waste management a...",59250
2,2000,1,"Agriculture, forestry, fishing and hunting [11]",61750
3,2000,1,"Arts, entertainment and recreation [71]",39500
4,2000,1,Construction [23],106250


In [24]:
df_2_NAICS.shape

(5472, 4)

### Separate the Industry description and NAICS code 

In [26]:
# Separate the Industry description and NAICS code 
code2 = pd.DataFrame(df_2_NAICS['NAICS'].str.split('[').tolist(), columns=["NAICS","code"])
code2 = pd.DataFrame(code2['code'].str.replace(']', ''))

In [27]:
code2

Unnamed: 0,code
0,72
1,56
2,11
3,71
4,23
...,...
5467,53
5468,44-45
5469,48-49
5470,22


In [28]:
df_2_NAICS["code"] = code2["code"]

In [29]:
df_2_NAICS

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_,code
0,2000,1,Accommodation and food services [72],148000,72
1,2000,1,"Administrative and support, waste management a...",59250,56
2,2000,1,"Agriculture, forestry, fishing and hunting [11]",61750,11
3,2000,1,"Arts, entertainment and recreation [71]",39500,71
4,2000,1,Construction [23],106250,23
...,...,...,...,...,...
715,1999,12,"Real estate, rental and leasing [53]",37000,53
716,1999,12,Retail trade [44-45],230750,44-45
717,1999,12,Transportation and warehousing [48-49],117500,48-49
718,1999,12,Utilities [22],10250,22


In [30]:
# Function to get the 'LMO_Detailed_Industry' values for a NAICS code in RTRA files
def add_lmo_industry(df):
    lmo_df = LMO_Detailed_Industries_df.apply(lambda y: y["LMO_Detailed_Industry"] 
                                                if (df['code'] in y['list_NAICS']) else np.nan, axis=1)
    lmo_df = lmo_df.dropna(how='all', axis=0)
    if lmo_df.empty:
        lmo_df = np.nan
    else:
        lmo_df = lmo_df.to_string(index=False)
    return lmo_df

In [31]:
# Get the LMO_Detailed_Industry for the 2digit NAICS RTRA file
df_2_NAICS["LMO_Detailed_Industry"] = df_2_NAICS.apply(add_lmo_industry, axis=1)
df_2_NAICS.head()

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_,code,LMO_Detailed_Industry
0,2000,1,Accommodation and food services [72],148000,72,
1,2000,1,"Administrative and support, waste management a...",59250,56,"Business, building and other support services"
2,2000,1,"Agriculture, forestry, fishing and hunting [11]",61750,11,
3,2000,1,"Arts, entertainment and recreation [71]",39500,71,
4,2000,1,Construction [23],106250,23,Construction


### c- Loading and exploring the 2-Digit NAICS Industries data:

In [32]:
# Get the data of 3digit NAICS industries
df_3_NAICS = pd.concat(map(pd.read_csv, ['Data/RTRA_Employ_3NAICS_00_05.csv','Data/RTRA_Employ_3NAICS_06_10.csv', 
                                         'Data/RTRA_Employ_3NAICS_11_15.csv','Data/RTRA_Employ_3NAICS_16_20.csv',
                                         'Data/RTRA_Employ_3NAICS_97_99.csv']))

In [33]:
df_3_NAICS.head()

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_
0,2000,1,Aboriginal public administration[914],500
1,2000,1,Accommodation services[721],33750
2,2000,1,Administrative and support services[561],55250
3,2000,1,Air transportation[481],17500
4,2000,1,Ambulatory health care services[621],53000


In [34]:
# Separate the Industry description and NAICS code 
code3 = pd.DataFrame(df_3_NAICS['NAICS'].str.split('[').tolist(), columns=["NAICS","code"])
code3 = pd.DataFrame(code3['code'].str.replace(']', ''))

In [35]:
df_3_NAICS["code"] = code3["code"]

In [36]:
df_3_NAICS

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_,code
0,2000,1,Aboriginal public administration[914],500,914
1,2000,1,Accommodation services[721],33750,721
2,2000,1,Administrative and support services[561],55250,561
3,2000,1,Air transportation[481],17500,481
4,2000,1,Ambulatory health care services[621],53000,621
...,...,...,...,...,...
3703,1999,12,Utilities[221],10000,413
3704,1999,12,Warehousing and storage[493],4500,113
3705,1999,12,Waste management and remediation services[562],4500,526
3706,1999,12,Water transportation[483],6750,442


In [37]:
# Get the LMO_Detailed_Industry for the 2digit NAICS RTRA file
df_3_NAICS["LMO_Detailed_Industry"] = df_3_NAICS.apply(add_lmo_industry, axis=1)
df_3_NAICS.head()

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_,code,LMO_Detailed_Industry
0,2000,1,Aboriginal public administration[914],500,914,Local and Indigenous public administration
1,2000,1,Accommodation services[721],33750,721,Accommodation services
2,2000,1,Administrative and support services[561],55250,561,
3,2000,1,Air transportation[481],17500,481,Air transportation
4,2000,1,Ambulatory health care services[621],53000,621,Ambulatory health care services


### d- Loading and exploring the 2-Digit NAICS Industries data:

In [38]:
# Get the data of 3digit NAICS industries
df_4_NAICS = pd.concat(map(pd.read_csv, ['Data/RTRA_Employ_4NAICS_00_05.csv','Data/RTRA_Employ_4NAICS_06_10.csv', 
                                         'Data/RTRA_Employ_4NAICS_11_15.csv','Data/RTRA_Employ_4NAICS_16_20.csv',
                                         'Data/RTRA_Employ_4NAICS_97_99.csv']))

In [39]:
df_4_NAICS.head()

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_
0,2000,1,1100,500
1,2000,1,1111,0
2,2000,1,1112,2000
3,2000,1,1113,250
4,2000,1,1114,7750


In [40]:
# Separate the Industry description and NAICS code 
df_4_NAICS['code'] = df_4_NAICS['NAICS'] 

In [41]:
df_4_NAICS.head()

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_,code
0,2000,1,1100,500,1100
1,2000,1,1111,0,1111
2,2000,1,1112,2000,1112
3,2000,1,1113,250,1113
4,2000,1,1114,7750,1114


In [42]:
# Get the LMO_Detailed_Industry for the 4-digits NAICS RTRA file
df_4_NAICS["LMO_Detailed_Industry"] = df_4_NAICS.apply(add_lmo_industry, axis=1)
df_4_NAICS.head()

Unnamed: 0,SYEAR,SMTH,NAICS,_EMPLOYMENT_,code,LMO_Detailed_Industry
0,2000,1,1100,500,1100,
1,2000,1,1111,0,1111,
2,2000,1,1112,2000,1112,
3,2000,1,1113,250,1113,
4,2000,1,1114,7750,1114,


## Calculate Industry-wise Employment Summary

In [None]:
cols = ["SYEAR", "SMTH", "LMO_Detailed_Industry", "_EMPLOYMENT_"]

# Creating a single dataframe with the columns Year, Month and LMO Industry and Employment from all the 2, 3 and 4 digits NAICS
