# Chapter 0 - Data Preparation & SQL Database Setup

Chapter 0 is going to take on the preprocessing component of this project. The raw data files need to be in a format that has a shared index so that using SQL, the prepared tables can create an SQL database. The database is called 'maryland_economic_database' and the schema is found under 'reports'. For more information regarding the data that has been collected and used in this project, it can be found in 'reports'. With the SQL database combining the tables, results can be exported back into the jupyter notebook for further use.

### 0.1 - Sales Data Preparation

In [1]:
#Import the pandas library as pd for data manipulation
import pandas as pd
#Import the numpy library as np for numerical computations
import numpy as np

In [2]:
#Import Data for Maryland Car Sales from the raw datasets folder
#Save it as a variable so that it can be referred back
sales_data_month = pd.read_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/MVA_Vehicle_Sales_Counts_by_Month_for_Calendar_Year_2002_through_December_2023.csv')
sales_data_month

Unnamed: 0,Year,Month,New,Used,Total Sales New,Total Sales Used
0,2002,JAN,31106,49927,755015820,386481929
1,2002,FEB,27520,50982,664454223,361353242
2,2002,MAR,34225,58794,805666244,419385387
3,2002,APR,36452,59817,846368297,433061150
4,2002,MAY,37359,60577,855005784,442569410
...,...,...,...,...,...,...
259,2023,AUG,25876,52725,1222581892,908454060
260,2023,SEP,23892,45386,1134437699,744676584
261,2023,OCT,23775,45473,1122680147,740582533
262,2023,NOV,22720,42260,1062465105,694190564


The raw data that has been imported is dataset 1 in the bibliography. It is the table that contains number of new and used cars sold as well as value of new and used cars sold based on their monthly count. There is a 'Year' column and a 'Month' column, this needs to be changed into the format YYYY-MM-DD in one column. Using pandas datetime, one date column is going to be created and will be the index for all the data tables. The data goes from January of 2002 to December of 2023 which gives 22 years of monthly car sales data

In [3]:
#Use the .strip() method this will get rid of any trailing white space from each column name in the dataframe
#Save the columns with this updated formating
sales_data_month.columns = sales_data_month.columns.str.strip()
sales_data_month.columns

Index(['Year', 'Month', 'New', 'Used', 'Total Sales New', 'Total Sales Used'], dtype='object')

In [4]:
#Create a column named 'Day' which starts at the beginning of every month (1.0)
#Creating a NumPy array where the size is the number of observations in the sales_data_month dataframe
Day = np.ones(sales_data_month.shape[0])

#Insert the column 'Day' to the right of 'Month' to improve logic of column order
sales_data_month.insert(2, 'Day', Day)
sales_data_month

Unnamed: 0,Year,Month,Day,New,Used,Total Sales New,Total Sales Used
0,2002,JAN,1.0,31106,49927,755015820,386481929
1,2002,FEB,1.0,27520,50982,664454223,361353242
2,2002,MAR,1.0,34225,58794,805666244,419385387
3,2002,APR,1.0,36452,59817,846368297,433061150
4,2002,MAY,1.0,37359,60577,855005784,442569410
...,...,...,...,...,...,...,...
259,2023,AUG,1.0,25876,52725,1222581892,908454060
260,2023,SEP,1.0,23892,45386,1134437699,744676584
261,2023,OCT,1.0,23775,45473,1122680147,740582533
262,2023,NOV,1.0,22720,42260,1062465105,694190564


In [5]:
#Column 'Day' has been added. 
#Convert column 'Month' into a numerical list.
#Map the months of year to a numerical value as January, February, ..., December to 1,2,...,12
#Uppercase the column 'Months' to reduce case sensitivity errors
numericalMonth = {'JAN':1, 'FEB':2, 'MAR':3, 'APR':4, 'MAY':5, 'JUN':6, 'JUL':7, 'AUG':8, 'SEP':9,
       'OCT':10, 'NOV':11, 'DEC':12}
sales_data_month['Month'] = sales_data_month['Month'].str.upper().map(numericalMonth)
sales_data_month

Unnamed: 0,Year,Month,Day,New,Used,Total Sales New,Total Sales Used
0,2002,1,1.0,31106,49927,755015820,386481929
1,2002,2,1.0,27520,50982,664454223,361353242
2,2002,3,1.0,34225,58794,805666244,419385387
3,2002,4,1.0,36452,59817,846368297,433061150
4,2002,5,1.0,37359,60577,855005784,442569410
...,...,...,...,...,...,...,...
259,2023,8,1.0,25876,52725,1222581892,908454060
260,2023,9,1.0,23892,45386,1134437699,744676584
261,2023,10,1.0,23775,45473,1122680147,740582533
262,2023,11,1.0,22720,42260,1062465105,694190564


Observe that there is now a year, month and day column which is what is neccessary to create a datetime in the pandas package. This is going to be very useful when looking to create time series graphs as well as intepreting the value as a date.

In [6]:
#Create a new 'Date' column by combining columns 'Year', 'Month' and 'Day' 
#This new date column can be used as a timestamp for the time series graph
#Insert it at the beginning of the column order as it makes most sense to
date = pd.to_datetime(sales_data_month[['Year', 'Month', 'Day']])
sales_data_month.insert(0, 'Date', date)
sales_data_month

Unnamed: 0,Date,Year,Month,Day,New,Used,Total Sales New,Total Sales Used
0,2002-01-01,2002,1,1.0,31106,49927,755015820,386481929
1,2002-02-01,2002,2,1.0,27520,50982,664454223,361353242
2,2002-03-01,2002,3,1.0,34225,58794,805666244,419385387
3,2002-04-01,2002,4,1.0,36452,59817,846368297,433061150
4,2002-05-01,2002,5,1.0,37359,60577,855005784,442569410
...,...,...,...,...,...,...,...,...
259,2023-08-01,2023,8,1.0,25876,52725,1222581892,908454060
260,2023-09-01,2023,9,1.0,23892,45386,1134437699,744676584
261,2023-10-01,2023,10,1.0,23775,45473,1122680147,740582533
262,2023-11-01,2023,11,1.0,22720,42260,1062465105,694190564


This is how we want the table to look, the next step is to drop the columns year, month and day as they do not add any information to the table anymore. After we have changed the table to be in the new form. This table is going to be saved in the car_sales_datasets folder as this is now the table that will be accessed in the database.

In [7]:
#Drop columns 'Year', 'Month' and 'Day' as they are not neccessary anymore
sales_data_month.drop(columns={'Year','Month','Day'},inplace=True)
#View the updated dataframe
sales_data_month

Unnamed: 0,Date,New,Used,Total Sales New,Total Sales Used
0,2002-01-01,31106,49927,755015820,386481929
1,2002-02-01,27520,50982,664454223,361353242
2,2002-03-01,34225,58794,805666244,419385387
3,2002-04-01,36452,59817,846368297,433061150
4,2002-05-01,37359,60577,855005784,442569410
...,...,...,...,...,...
259,2023-08-01,25876,52725,1222581892,908454060
260,2023-09-01,23892,45386,1134437699,744676584
261,2023-10-01,23775,45473,1122680147,740582533
262,2023-11-01,22720,42260,1062465105,694190564


In [8]:
#There are 6 different variables
#The column names of the data are not very clear
#To improve the cohesiveness of the project, certain columns are going to be renamed
#Rename the following columns: 'New' as 'New Cars Sold', 'Used' as 'Used Cars Sold', 'Total Sales New' as 'Value of New Cars Sold', 'Total Sales Used' as 'Value of Used Cars Sold'
sales_data_month.rename(columns={'New':'New Cars Sold', 'Used':'Used Cars Sold', 'Total Sales New': 'Value of New Cars Sold', 'Total Sales Used':'Value of Used Cars Sold'}, inplace=True)
sales_data_month

Unnamed: 0,Date,New Cars Sold,Used Cars Sold,Value of New Cars Sold,Value of Used Cars Sold
0,2002-01-01,31106,49927,755015820,386481929
1,2002-02-01,27520,50982,664454223,361353242
2,2002-03-01,34225,58794,805666244,419385387
3,2002-04-01,36452,59817,846368297,433061150
4,2002-05-01,37359,60577,855005784,442569410
...,...,...,...,...,...
259,2023-08-01,25876,52725,1222581892,908454060
260,2023-09-01,23892,45386,1134437699,744676584
261,2023-10-01,23775,45473,1122680147,740582533
262,2023-11-01,22720,42260,1062465105,694190564


In sales_data_month you can observe that the day has been included, this is simply for coding. Ignore the day and understand that each value is the sales of that month. For example the first entry 2002-01-01 is the data that is found from January of 2002 so it could be read as January - 2002. 

In [9]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
sales_data_month.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/original_car_sales_dataset/monthly_mva_car_sales.csv')

The next step is to convert this data from a month to month tracking of sales in Maryland to a year to year time frame. This can be done by adding up the values in each calendar year. This is going to be useful to look at the overall trend of the data. This is a method of smoothing that is commonly used to understand trends easier.

In [10]:
#Set the dataframe sales_data_year equal to sales_data_month
#As the new dataframe contains the same data just with a year to year time period instead of monthly.
sales_data_year = sales_data_month

#You have a column named date which has day,month and year stored in the form YYYY-MM-DD. 
#Using the datetime (dt) you are able to specifically use the year value and sum the values which have the same year value
sales_data_year = sales_data_year.groupby(sales_data_year['Date'].dt.year).sum()

sales_data_year

Unnamed: 0_level_0,New Cars Sold,Used Cars Sold,Value of New Cars Sold,Value of Used Cars Sold
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2002,402164,656089,9572554876,4940209772
2003,428252,675358,10623148339,5166071497
2004,428508,699677,10972956024,5385548947
2005,421834,703835,11006733922,5639803515
2006,399282,696968,10411657206,5865687318
2007,378184,678549,9997917483,5794213869
2008,309159,617885,7975401987,4896167840
2009,248928,608889,6686217914,4510040527
2010,268022,626045,7460915522,5054802303
2011,287669,625728,8331732402,5345074083


In [11]:
#To make sure that these values are a true representation of the year, manually add up the values in 2002 and compare
#If the outcome of the equation below is true then we can assume that all columns have been added appropriately
sales_2002 = sales_data_month['New Cars Sold'].iloc[0:12]
sales_2002.sum()==sales_data_year['New Cars Sold'][2002]

True

In [12]:
#Make sure that the date is not the index as this will create bottlenecks in the future.
sales_data_year.reset_index(inplace=True)
#Also want to make sure that date is in the same format YYYY-MM-DD so that other time series and other softwares can view date correctly
sales_data_year['Date'] = pd.to_datetime(sales_data_year['Date'], format='%Y')
sales_data_year

Unnamed: 0,Date,New Cars Sold,Used Cars Sold,Value of New Cars Sold,Value of Used Cars Sold
0,2002-01-01,402164,656089,9572554876,4940209772
1,2003-01-01,428252,675358,10623148339,5166071497
2,2004-01-01,428508,699677,10972956024,5385548947
3,2005-01-01,421834,703835,11006733922,5639803515
4,2006-01-01,399282,696968,10411657206,5865687318
5,2007-01-01,378184,678549,9997917483,5794213869
6,2008-01-01,309159,617885,7975401987,4896167840
7,2009-01-01,248928,608889,6686217914,4510040527
8,2010-01-01,268022,626045,7460915522,5054802303
9,2011-01-01,287669,625728,8331732402,5345074083


In [13]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
sales_data_year.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/original_car_sales_dataset/yearly_mva_car_sales.csv')

There is a similar scenario in car_sales_year as car_sales_month in which the dates are a little bit confusing to digest. In this case ignore the month and the date column. For example the first row is the combination of the sales in 2002 from January to December. So it could also be read as Sales from 2002 instead of 2002-01-01. This again is done for coding reasons for future analysis of the data.

### 0.2 - Economics Data Preparation

In [14]:
#Import Data for Average Annual Pay from the raw datasets folder
average_annual_pay = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/average_annual_pay.xlsx')
average_annual_pay

Unnamed: 0,Year,Annual
0,2002,39382
1,2003,40686
2,2004,42579
3,2005,44368
4,2006,46162
5,2007,48241
6,2008,49535
7,2009,50579
8,2010,51739
9,2011,53008


In [15]:
#You can see that the year needs to be implemented to a datetime in the format YYYY-MM-DD. 
average_annual_pay['Date'] = pd.to_datetime(average_annual_pay['Year'], format='%Y')
average_annual_pay.drop(columns='Year',inplace=True)
average_annual_pay.rename(columns={'Annual': 'Annual Average Pay'}, inplace=True)
average_annual_pay

Unnamed: 0,Annual Average Pay,Date
0,39382,2002-01-01
1,40686,2003-01-01
2,42579,2004-01-01
3,44368,2005-01-01
4,46162,2006-01-01
5,48241,2007-01-01
6,49535,2008-01-01
7,50579,2009-01-01
8,51739,2010-01-01
9,53008,2011-01-01


In [16]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
average_annual_pay.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets/average_annual_pay.csv')

In [17]:
#Import Data for Average Weekly Wage from the raw datasets folder
average_weekly_wage = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/average_weekly_wage.xlsx')
average_weekly_wage

Unnamed: 0,Year,Qtr1,Qtr2,Qtr3,Qtr4,Annual
0,2002,757,738,733,802,757
1,2003,775,760,763,831,782
2,2004,815,786,794,879,819
3,2005,831,817,854,910,853
4,2006,898,854,857,942,888
5,2007,937,895,892,986,928
6,2008,963,920,919,1009,953
7,2009,963,934,941,1052,973
8,2010,975,957,966,1080,995
9,2011,1010,985,1023,1059,1019


The data has quarter 1, quarter 2, quarter 3 and quarter 4 as separate columns. There is also a column that has the annual value. This is taken by adding up each quarter and dividing it by the four different quarters. The only column that is going to be added to the database is the annual value. Drop the other columns and change year to date so that it can be joined to the table yearly economic data in Maryland.

In [18]:
#Drop Qtr1, Qtr2, Qtr3, Qtr4
average_weekly_wage.drop(columns={'Qtr1', 'Qtr2', 'Qtr3', 'Qtr4'},inplace=True)
average_weekly_wage

Unnamed: 0,Year,Annual
0,2002,757
1,2003,782
2,2004,819
3,2005,853
4,2006,888
5,2007,928
6,2008,953
7,2009,973
8,2010,995
9,2011,1019


In [19]:
#You can see that the year needs to be implemented to a datetime in the format YYYY-MM-DD. 
average_weekly_wage['Date'] = pd.to_datetime(average_weekly_wage['Year'], format='%Y')
average_weekly_wage.drop(columns='Year',inplace=True)
average_weekly_wage.rename(columns={'Annual': 'Average Weekly Wage'}, inplace=True)
average_weekly_wage

Unnamed: 0,Average Weekly Wage,Date
0,757,2002-01-01
1,782,2003-01-01
2,819,2004-01-01
3,853,2005-01-01
4,888,2006-01-01
5,928,2007-01-01
6,953,2008-01-01
7,973,2009-01-01
8,995,2010-01-01
9,1019,2011-01-01


In [20]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
average_weekly_wage.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets/average_weekly_wage.csv')

In [21]:
#Import Data for Gross Domestic Product (GDP) in the state of Maryland from the raw datasets folder
gdp_state_maryland = pd.read_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/gdp_state_maryland_(dollars).csv')
gdp_state_maryland

Unnamed: 0,Years,Maryland
0,2022,470187100000.0
1,2021,443929900000.0
2,2020,410931000000.0
3,2019,420371300000.0
4,2018,410812200000.0
5,2017,400157000000.0
6,2016,387733400000.0
7,2015,369728100000.0
8,2014,352761300000.0
9,2013,340577600000.0


In [22]:
#Need to swap the order of Year and Maryland around so that it starts with the smallest year and increases to the largest.
gdp_state_maryland = gdp_state_maryland.sort_values(by = 'Years')
gdp_state_maryland

Unnamed: 0,Years,Maryland
25,1997,159372100000.0
24,1998,169982100000.0
23,1999,180261600000.0
22,2000,192021400000.0
21,2001,205981000000.0
20,2002,217837100000.0
19,2003,228959300000.0
18,2004,245124500000.0
17,2005,262100000000.0
16,2006,274145300000.0


In [23]:
#You can see that the year needs to be implemented to a datetime in the format YYYY-MM-DD. 
gdp_state_maryland['Date'] = pd.to_datetime(gdp_state_maryland['Years'], format='%Y')
gdp_state_maryland.drop(columns='Years',inplace=True)
gdp_state_maryland.rename(columns={'Maryland': 'GDP for Maryland State'}, inplace=True)
gdp_state_maryland

Unnamed: 0,GDP for Maryland State,Date
25,159372100000.0,1997-01-01
24,169982100000.0,1998-01-01
23,180261600000.0,1999-01-01
22,192021400000.0,2000-01-01
21,205981000000.0,2001-01-01
20,217837100000.0,2002-01-01
19,228959300000.0,2003-01-01
18,245124500000.0,2004-01-01
17,262100000000.0,2005-01-01
16,274145300000.0,2006-01-01


In [24]:
#Need to remove entries 1997-2001 as this time period will not be used when looking at car sales
#First define the start and the end date of where you want your dataframe to exist
start_date = '2002-01-01'
end_date = '2022-01-01'

#Create a boolean mask based on the date range
mask = (gdp_state_maryland['Date'] >= start_date) & (gdp_state_maryland['Date'] <= end_date)

#Apply the mask to filter the rows
gdp_state_maryland = gdp_state_maryland.loc[mask]
gdp_state_maryland

Unnamed: 0,GDP for Maryland State,Date
20,217837100000.0,2002-01-01
19,228959300000.0,2003-01-01
18,245124500000.0,2004-01-01
17,262100000000.0,2005-01-01
16,274145300000.0,2006-01-01
15,282965600000.0,2007-01-01
14,295400200000.0,2008-01-01
13,299102700000.0,2009-01-01
12,314728300000.0,2010-01-01
11,326281200000.0,2011-01-01


In [25]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
gdp_state_maryland.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets/gdp_state_maryland.csv')

In [26]:
#Import Data for Yearly Inflation Rates in Maryland from the raw datasets folder
yearly_inflation_rate = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/yearly_inflation_rate_usa.xlsx')
yearly_inflation_rate

  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,Consumer Price Index for All Urban Consumers (CPI-U),Unnamed: 1
0,12-Month Percent Change,
1,,
2,Series Id:,CUUR0000SA0L1E
3,Not Seasonally Adjusted,
4,Series Title:,All items less food and energy in U.S. city av...
5,Area:,U.S. city average
6,Item:,All items less food and energy
7,Base Period:,1982-84=100
8,Years:,2002 to 2023
9,,


In [27]:
#The first 11 rows is information about the data, so drop this.
yearly_inflation_rate = yearly_inflation_rate.iloc[11:]
yearly_inflation_rate.reset_index(inplace=True)
yearly_inflation_rate.drop(columns='index',inplace=True)
yearly_inflation_rate

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yearly_inflation_rate.drop(columns='index',inplace=True)


Unnamed: 0,Consumer Price Index for All Urban Consumers (CPI-U),Unnamed: 1
0,2002,2.4
1,2003,1.4
2,2004,1.8
3,2005,2.2
4,2006,2.5
5,2007,2.3
6,2008,2.3
7,2009,1.7
8,2010,1.0
9,2011,1.7


In [28]:
#Change the name of columns
yearly_inflation_rate_columns = ['Year', 'Inflation Rate']
yearly_inflation_rate.columns = yearly_inflation_rate_columns
yearly_inflation_rate

Unnamed: 0,Year,Inflation Rate
0,2002,2.4
1,2003,1.4
2,2004,1.8
3,2005,2.2
4,2006,2.5
5,2007,2.3
6,2008,2.3
7,2009,1.7
8,2010,1.0
9,2011,1.7


In [29]:
#You can see that the year needs to be implemented to a datetime in the format YYYY-MM-DD. 
yearly_inflation_rate['Date'] = pd.to_datetime(yearly_inflation_rate['Year'], format='%Y')
yearly_inflation_rate.drop(columns='Year',inplace=True)
yearly_inflation_rate

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yearly_inflation_rate['Date'] = pd.to_datetime(yearly_inflation_rate['Year'], format='%Y')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yearly_inflation_rate.drop(columns='Year',inplace=True)


Unnamed: 0,Inflation Rate,Date
0,2.4,2002-01-01
1,1.4,2003-01-01
2,1.8,2004-01-01
3,2.2,2005-01-01
4,2.5,2006-01-01
5,2.3,2007-01-01
6,2.3,2008-01-01
7,1.7,2009-01-01
8,1.0,2010-01-01
9,1.7,2011-01-01


In [30]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
yearly_inflation_rate.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets/yearly_inflation_rate_usa.csv')

In [31]:
#Import Data for population in Maryland from the raw datasets folder
maryland_population = pd.read_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/maryland-population-2024-02-16.csv')
maryland_population

Unnamed: 0,date,Population,Annual Change
0,1900-12-01,1189000,
1,1901-12-01,1200000,0.93
2,1902-12-01,1210000,0.83
3,1903-12-01,1209000,-0.08
4,1904-12-01,1217000,0.66
...,...,...,...
119,2019-12-01,6054954,0.21
120,2020-12-01,6173689,1.96
121,2021-12-01,6175045,0.02
122,2022-12-01,6163981,-0.18


In [32]:
#Convert the date so that it falls on the first of January for the year of which the population took place
maryland_population['date'] = pd.to_datetime(maryland_population['date']) + pd.DateOffset(months=1) - pd.DateOffset(years=1)
maryland_population

Unnamed: 0,date,Population,Annual Change
0,1900-01-01,1189000,
1,1901-01-01,1200000,0.93
2,1902-01-01,1210000,0.83
3,1903-01-01,1209000,-0.08
4,1904-01-01,1217000,0.66
...,...,...,...
119,2019-01-01,6054954,0.21
120,2020-01-01,6173689,1.96
121,2021-01-01,6175045,0.02
122,2022-01-01,6163981,-0.18


In [33]:
#Rename the column name from date to Date 
#Drop the old date column and the annual change column
maryland_population['Date'] = maryland_population['date']
maryland_population.drop(columns={' Annual Change', 'date'}, inplace=True)
maryland_population.rename(columns={'Population': 'Maryland Population'}, inplace=True)
maryland_population

Unnamed: 0,Population,Date
0,1189000,1900-01-01
1,1200000,1901-01-01
2,1210000,1902-01-01
3,1209000,1903-01-01
4,1217000,1904-01-01
...,...,...
119,6054954,2019-01-01
120,6173689,2020-01-01
121,6175045,2021-01-01
122,6163981,2022-01-01


In [34]:
#Only keep the years from 2002-2023
maryland_population = maryland_population.iloc[102:124]
maryland_population

Unnamed: 0,Population,Date
102,5440389,2002-01-01
103,5496269,2003-01-01
104,5546935,2004-01-01
105,5592379,2005-01-01
106,5627367,2006-01-01
107,5653408,2007-01-01
108,5684965,2008-01-01
109,5730388,2009-01-01
110,5788784,2010-01-01
111,5840241,2011-01-01


In [35]:
#Reset the index and then drop the column index
maryland_population.reset_index(inplace=True)
maryland_population['Population'] = maryland_population[' Population']
maryland_population.drop(columns=' Population',inplace=True)
maryland_population.drop(columns='index', inplace=True)
maryland_population

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  maryland_population['Population'] = maryland_population[' Population']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  maryland_population.drop(columns=' Population',inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  maryland_population.drop(columns='index', inplace=True)


Unnamed: 0,Date,Population
0,2002-01-01,5440389
1,2003-01-01,5496269
2,2004-01-01,5546935
3,2005-01-01,5592379
4,2006-01-01,5627367
5,2007-01-01,5653408
6,2008-01-01,5684965
7,2009-01-01,5730388
8,2010-01-01,5788784
9,2011-01-01,5840241


In [36]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
maryland_population.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets/maryland_population.csv')

In [37]:
#Import Data for total wages (in thousands) in Maryland from the raw datasets folder
total_wages_thousands = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/Total_Wages_(in_thousands).xlsx')
total_wages_thousands

Unnamed: 0,Year,Qtr1,Qtr2,Qtr3,Qtr4,Annual
0,2002,23526501,23432344,23153527,25477137,95589510
1,2003,24058561,24219663,24211105,26550627,99039956
2,2004,25526995,25252798,25390468,28546250,104716511
3,2005,26398381,26671481,27803944,29934119,110807924
4,2006,29033832,28261692,28168918,31326900,116791342
5,2007,30559128,29845307,29496454,32985771,122886661
6,2008,31501211,30679882,30228395,33297097,125706584
7,2009,30629666,30193426,29922480,33734359,124479932
8,2010,30334968,30881430,30800610,34909278,126926286
9,2011,31932406,31959394,32944856,34543144,131379799


In [38]:
#Convert annual (in thousands) to include the 3 extra decimal places
total_wages_thousands['Annual'] = total_wages_thousands['Annual'] * 1000
total_wages_thousands

Unnamed: 0,Year,Qtr1,Qtr2,Qtr3,Qtr4,Annual
0,2002,23526501,23432344,23153527,25477137,95589510000
1,2003,24058561,24219663,24211105,26550627,99039956000
2,2004,25526995,25252798,25390468,28546250,104716511000
3,2005,26398381,26671481,27803944,29934119,110807924000
4,2006,29033832,28261692,28168918,31326900,116791342000
5,2007,30559128,29845307,29496454,32985771,122886661000
6,2008,31501211,30679882,30228395,33297097,125706584000
7,2009,30629666,30193426,29922480,33734359,124479932000
8,2010,30334968,30881430,30800610,34909278,126926286000
9,2011,31932406,31959394,32944856,34543144,131379799000


In [39]:
#You can see that the year needs to be implemented to a datetime in the format YYYY-MM-DD. 
total_wages_thousands['Date'] = pd.to_datetime(total_wages_thousands['Year'], format='%Y')
#Drop the quarterly columns as they do not add to the yearly car sales analysis
total_wages_thousands.drop(columns={'Year','Qtr1','Qtr2','Qtr3','Qtr4'},inplace=True)
total_wages_thousands.rename(columns={'Annual': 'Total Wages in Maryland'}, inplace=True)
total_wages_thousands

Unnamed: 0,Total Wages in Maryland,Date
0,95589510000,2002-01-01
1,99039956000,2003-01-01
2,104716511000,2004-01-01
3,110807924000,2005-01-01
4,116791342000,2006-01-01
5,122886661000,2007-01-01
6,125706584000,2008-01-01
7,124479932000,2009-01-01
8,126926286000,2010-01-01
9,131379799000,2011-01-01


In [40]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
total_wages_thousands.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets/total_wages(thousands).csv')

In [41]:
#Import Data for total workforce in Maryland from the raw datasets folder
total_workforce = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/total_workforce.xlsx')
total_workforce

Unnamed: 0,Year,Maryland Total Workforce
0,2001,2423138
1,2002,2427396
2,2003,2434480
3,2004,2461074
4,2005,2497416
5,2006,2530129
6,2007,2546850
7,2008,2537400
8,2009,2460972
9,2010,2454418


In [42]:
#You can see that the year needs to be implemented to a datetime in the format YYYY-MM-DD. 
total_workforce['Date'] = pd.to_datetime(total_workforce['Year'], format='%Y')
total_workforce =total_workforce.iloc[1:] # Do not need to include the year 2001 and its workforce
total_workforce.drop(columns='Year',inplace=True)
total_workforce

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_workforce.drop(columns='Year',inplace=True)


Unnamed: 0,Maryland Total Workforce,Date
1,2427396,2002-01-01
2,2434480,2003-01-01
3,2461074,2004-01-01
4,2497416,2005-01-01
5,2530129,2006-01-01
6,2546850,2007-01-01
7,2537400,2008-01-01
8,2460972,2009-01-01
9,2454418,2010-01-01
10,2479122,2011-01-01


In [43]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
total_workforce.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets/total_workforce.csv')

In [44]:
#Import Data for unemployment (not seasonally adjusted) in Maryland from the raw datasets folder
unemployment_not_sa = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/unemployment_not_seasonally_adjusted.xlsx')
unemployment_not_sa

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,2002,5.5,5.2,5.1,4.9,4.7,5.1,4.8,4.7,4.3,4.3,4.5,4.3
1,2003,5.0,5.1,5.0,4.6,4.7,5.3,5.0,4.8,4.5,4.5,4.6,4.4
2,2004,5.0,4.9,4.8,4.2,4.4,4.9,4.9,4.6,4.3,4.4,4.5,4.4
3,2005,5.0,5.1,4.8,4.3,4.5,4.8,4.7,4.4,4.1,4.1,4.2,3.8
4,2006,4.3,4.3,3.9,3.9,4.2,4.6,4.7,4.4,4.1,3.9,4.0,3.8
5,2007,4.4,4.2,3.7,3.4,3.5,4.0,4.0,3.7,3.5,3.5,3.3,3.3
6,2008,3.9,3.8,3.8,3.5,4.0,4.6,4.7,4.8,4.7,5.1,5.4,5.8
7,2009,7.1,7.5,7.5,7.0,7.5,8.0,7.9,7.7,7.5,7.7,7.6,7.6
8,2010,8.5,8.6,8.4,7.5,7.5,7.9,8.1,8.0,7.5,7.4,7.6,7.2
9,2011,7.9,7.8,7.4,6.9,7.2,7.7,7.6,7.5,7.2,7.0,6.7,6.7


This is the first dataset that is going to be added to the monthly table. The rows is from 2002-2023 and the columns is the months from January to December. Need to change the data so that each month becomes a new row. 

In [45]:
#Create a list of the column months that are going to be iterated through
months = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

In [46]:
#Use the .melt() function so that you can convert the columns to a singular column based on the month of the value
unemployment_not_sa = unemployment_not_sa.melt(id_vars=['Year'], var_name='month', value_vars=months)
unemployment_not_sa

Unnamed: 0,Year,month,value
0,2002,Jan,5.5
1,2003,Jan,5.0
2,2004,Jan,5.0
3,2005,Jan,5.0
4,2006,Jan,4.3
...,...,...,...
259,2019,Dec,2.9
260,2020,Dec,6.3
261,2021,Dec,3.2
262,2022,Dec,2.5


In [47]:
#Create a column named 'Day' which starts at the beginning of every month (1.0)
#Creating a NumPy array where the size is the number of observations in the sales_data_month dataframe
unemployment_not_sa['Day']= np.ones(unemployment_not_sa.shape[0])

#Insert the column 'Day' to the right of 'Month' to improve logic of column order
unemployment_not_sa

Unnamed: 0,Year,month,value,Day
0,2002,Jan,5.5,1.0
1,2003,Jan,5.0,1.0
2,2004,Jan,5.0,1.0
3,2005,Jan,5.0,1.0
4,2006,Jan,4.3,1.0
...,...,...,...,...
259,2019,Dec,2.9,1.0
260,2020,Dec,6.3,1.0
261,2021,Dec,3.2,1.0
262,2022,Dec,2.5,1.0


In [48]:
#Column 'Day' has been added. 
#Convert column 'Month' into a numerical list.
#Map the months of year to a numerical value as January, February, ..., December to 1,2,...,12
#Uppercase the column 'Months' to reduce case sensitivity errors
numericalMonth = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9,
       'Oct':10, 'Nov':11, 'Dec':12}
unemployment_not_sa['Month'] = unemployment_not_sa['month'].map(numericalMonth)
unemployment_not_sa

Unnamed: 0,Year,month,value,Day,Month
0,2002,Jan,5.5,1.0,1
1,2003,Jan,5.0,1.0,1
2,2004,Jan,5.0,1.0,1
3,2005,Jan,5.0,1.0,1
4,2006,Jan,4.3,1.0,1
...,...,...,...,...,...
259,2019,Dec,2.9,1.0,12
260,2020,Dec,6.3,1.0,12
261,2021,Dec,3.2,1.0,12
262,2022,Dec,2.5,1.0,12


In [49]:
#Create a column named date that is used to include the values from year, month and day
unemployment_not_sa['Date'] = pd.to_datetime(unemployment_not_sa[['Year', 'Month', 'Day']])
unemployment_not_sa

Unnamed: 0,Year,month,value,Day,Month,Date
0,2002,Jan,5.5,1.0,1,2002-01-01
1,2003,Jan,5.0,1.0,1,2003-01-01
2,2004,Jan,5.0,1.0,1,2004-01-01
3,2005,Jan,5.0,1.0,1,2005-01-01
4,2006,Jan,4.3,1.0,1,2006-01-01
...,...,...,...,...,...,...
259,2019,Dec,2.9,1.0,12,2019-12-01
260,2020,Dec,6.3,1.0,12,2020-12-01
261,2021,Dec,3.2,1.0,12,2021-12-01
262,2022,Dec,2.5,1.0,12,2022-12-01


In [50]:
#Drop the columns that are not necessary anymore
#The dataframe also has it so that all January months are at the beginning instead of a chronological datetime
#Use the sort_values method to fix this issue
unemployment_not_sa.sort_values(by='Date',inplace=True)
unemployment_not_sa.drop(columns={'Year', 'month', 'Day', 'Month'},inplace=True)
unemployment_not_sa.reset_index(inplace=True)
unemployment_not_sa.drop(columns='index',inplace=True)
unemployment_not_sa.rename(columns={'value': 'Unemployment rate (not sa)'}, inplace=True)
unemployment_not_sa

Unnamed: 0,Unemployment rate (not sa),Date
0,5.5,2002-01-01
1,5.2,2002-02-01
2,5.1,2002-03-01
3,4.9,2002-04-01
4,4.7,2002-05-01
...,...,...
259,1.8,2023-08-01
260,1.8,2023-09-01
261,2.1,2023-10-01
262,1.8,2023-11-01


In [51]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
unemployment_not_sa.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/monthly datasets/unemployment_not_seasonally_adjusted.csv')

That was a long process that may need to be repeated in the future. To avoid repeating code and time, that code is going to be made into a function so that it can be repeated at a later date. Below is going to be the function of the code above

In [52]:
def convert_columns_to_rows(data, months):
    #Melt the DataFrame
    data = data.melt(id_vars=['Year'], var_name='month', value_vars=months)

    #Create a column named 'Day' which starts at the beginning of every month (1.0)
    data['Day'] = np.ones(data.shape[0])

    #Convert column 'Month' into a numerical list.
    numericalMonth = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9,
           'Oct':10, 'Nov':11, 'Dec':12}
    data['Month'] = data['month'].map(numericalMonth)

    # Convert to datetime and sort
    data['Date'] = pd.to_datetime(data[['Year', 'Month', 'Day']])
    data.sort_values(by='Date', inplace=True)

    # Drop unnecessary columns
    data.drop(columns={'Year', 'month', 'Day', 'Month'}, inplace=True)
    data.reset_index(inplace=True, drop=True)

    return data

In [53]:
def convert_columns_to_rows(data, months):    

    # Melt the DataFrame
    data = data.melt(id_vars=['Year'], var_name='month', value_vars=months)

    # Create a column named 'Day' which starts at the beginning of every month (1.0)
    data['Day'] = np.ones(data.shape[0])

    # Convert column 'Month' into a numerical list.
    numericalMonth = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9,
           'Oct':10, 'Nov':11, 'Dec':12}
    data['Month'] = data['month'].map(numericalMonth)

    # Convert to datetime and sort
    data['Date'] = pd.to_datetime(data[['Year', 'Month', 'Day']])
    data.sort_values(by='Date', inplace=True)

    # Drop unnecessary columns
    data.drop(columns={'Year', 'month', 'Day', 'Month'}, inplace=True)
    data.reset_index(inplace=True, drop=True)

    return data

In [54]:
#Import Data for unemployment in Maryland from the raw datasets folder
unemployment_sa = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/unemployment_seasonally_adjusted.xlsx')
unemployment_sa

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,2002,5.0,5.0,5.0,4.9,4.9,4.8,4.7,4.6,4.6,4.6,4.6,4.6
1,2003,4.7,4.8,4.8,4.9,4.9,4.9,4.9,4.8,4.8,4.7,4.7,4.7
2,2004,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.7,4.7
3,2005,4.7,4.7,4.7,4.7,4.6,4.5,4.4,4.4,4.4,4.3,4.2,4.1
4,2006,4.1,4.0,4.0,4.1,4.2,4.3,4.3,4.3,4.3,4.2,4.2,4.1
5,2007,4.0,3.9,3.8,3.7,3.7,3.7,3.7,3.7,3.7,3.7,3.6,3.6
6,2008,3.6,3.6,3.7,3.8,4.0,4.2,4.5,4.7,5.0,5.3,5.7,6.2
7,2009,6.6,7.0,7.3,7.5,7.6,7.7,7.7,7.7,7.8,7.8,7.9,8
8,2010,8.1,8.1,8.0,7.9,7.9,7.8,7.7,7.7,7.7,7.7,7.7,7.6
9,2011,7.5,7.4,7.4,7.3,7.4,7.4,7.4,7.4,7.3,7.2,7.1,7


In [55]:
#Use the function that was defined above so that it can be in the correct form
unemployment_sa = convert_columns_to_rows(unemployment_sa, months=months)
unemployment_sa.rename(columns={'value': 'Unemployment rate'}, inplace=True)

In [56]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
unemployment_sa.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/monthly datasets/unemployment_seasonally_adjusted.csv')

In [57]:
#Import Data for monthly inflation in Maryland from the raw datasets folder
monthly_inflation_rate = pd.read_excel('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/raw_datasets/monthly_inflation_rate_usa.xlsx')
monthly_inflation_rate

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,HALF1,HALF2
0,2002,2.6,2.6,2.4,2.5,2.5,2.3,2.2,2.4,2.2,2.2,2.0,1.9,2.5,2.2
1,2003,1.9,1.7,1.7,1.5,1.6,1.5,1.5,1.3,1.2,1.3,1.1,1.1,1.7,1.3
2,2004,1.1,1.2,1.6,1.8,1.7,1.9,1.8,1.7,2.0,2.0,2.2,2.2,1.6,2.0
3,2005,2.3,2.4,2.3,2.2,2.2,2.0,2.1,2.1,2.0,2.1,2.1,2.2,2.2,2.1
4,2006,2.1,2.1,2.1,2.3,2.4,2.6,2.7,2.8,2.9,2.7,2.6,2.6,2.2,2.7
5,2007,2.7,2.7,2.5,2.3,2.2,2.2,2.2,2.1,2.1,2.2,2.3,2.4,2.4,2.3
6,2008,2.5,2.3,2.4,2.3,2.3,2.4,2.5,2.5,2.5,2.2,2.0,1.8,2.3,2.3
7,2009,1.7,1.8,1.8,1.9,1.8,1.7,1.5,1.4,1.5,1.7,1.7,1.8,1.8,1.6
8,2010,1.6,1.3,1.1,0.9,0.9,0.9,0.9,0.9,0.8,0.6,0.8,0.8,1.1,0.8
9,2011,1.0,1.1,1.2,1.3,1.5,1.6,1.8,2.0,2.0,2.1,2.2,2.2,1.3,2.0


In [58]:
#Drop the columns that are not necessary (half1 and half2)
monthly_inflation_rate.head(6)

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,HALF1,HALF2
0,2002,2.6,2.6,2.4,2.5,2.5,2.3,2.2,2.4,2.2,2.2,2.0,1.9,2.5,2.2
1,2003,1.9,1.7,1.7,1.5,1.6,1.5,1.5,1.3,1.2,1.3,1.1,1.1,1.7,1.3
2,2004,1.1,1.2,1.6,1.8,1.7,1.9,1.8,1.7,2.0,2.0,2.2,2.2,1.6,2.0
3,2005,2.3,2.4,2.3,2.2,2.2,2.0,2.1,2.1,2.0,2.1,2.1,2.2,2.2,2.1
4,2006,2.1,2.1,2.1,2.3,2.4,2.6,2.7,2.8,2.9,2.7,2.6,2.6,2.2,2.7
5,2007,2.7,2.7,2.5,2.3,2.2,2.2,2.2,2.1,2.1,2.2,2.3,2.4,2.4,2.3


In [59]:
#Use the function that was defined above to put the data in the correct format
monthly_inflation_rate = convert_columns_to_rows(monthly_inflation_rate, months=months)
monthly_inflation_rate.rename(columns={'value': 'Monthly Inflation Rate'}, inplace=True)
monthly_inflation_rate

Unnamed: 0,Monthly Inflation Rate,Date
0,2.6,2002-01-01
1,2.6,2002-02-01
2,2.4,2002-03-01
3,2.5,2002-04-01
4,2.5,2002-05-01
...,...,...
259,4.3,2023-08-01
260,4.1,2023-09-01
261,4.0,2023-10-01
262,4.0,2023-11-01


In [60]:
#This data is in the form that is required, so it can be saved to the economics_dataset folder in maryland_economics_database
monthly_inflation_rate.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/monthly datasets/monthly_inflation_rate.csv')

Now that all of the data has been converted into a similar form, using Python's concat method, a new dataframe can be created. This dataframe contains uses the date column as the way of joining all of these different dataframes into one dataframe. The code below is going to create a completed yearly economic dataset file which can be implemented into the relational database

In [61]:
#os package is that is used for dealing with files in python
import os

# Directory where your files are located
directory_yearly = '/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/yearly datasets'

# List all files in the directory
files_yearly = os.listdir(directory_yearly)

#Initialize an empty DataFrame to store the joined data
yearly_economic_dataset = pd.DataFrame()

# Iterate through each file
for file_name in files_yearly:
    file_path = os.path.join(directory_yearly, file_name)

    #Read the CSV file
    data_yearly = pd.read_csv(file_path)
    
    #Convert 'Date' column to datetime
    data_yearly['Date'] = pd.to_datetime(data_yearly['Date'])
    
    #Set 'Date' column as index
    data_yearly.set_index('Date', inplace=True)
    data_yearly.drop(columns='Unnamed: 0',inplace=True)
    
    #Join the data based on date
    yearly_economic_dataset = pd.concat([yearly_economic_dataset, data_yearly], axis=1)

    print(f"Data joined for file: {file_name}")

Data joined for file: yearly_inflation_rate_usa.csv
Data joined for file: total_wages(thousands).csv
Data joined for file: gdp_state_maryland.csv
Data joined for file: average_annual_pay.csv
Data joined for file: average_weekly_wage.csv
Data joined for file: total_workforce.csv
Data joined for file: maryland_population.csv


In [62]:
#Observe the dataset that contains economic information with the date as the joint index
yearly_economic_dataset.reset_index(inplace=True)
yearly_economic_dataset['Date'] = pd.to_datetime(yearly_economic_dataset['Date'])
yearly_economic_dataset

Unnamed: 0,Date,Inflation Rate,Total Wages in Maryland,GDP for Maryland State,Annual Average Pay,Average Weekly Wage,Maryland Total Workforce,Population
0,2002-01-01,2.4,95589510000.0,217837100000.0,39382.0,757.0,2427396.0,5440389
1,2003-01-01,1.4,99039960000.0,228959300000.0,40686.0,782.0,2434480.0,5496269
2,2004-01-01,1.8,104716500000.0,245124500000.0,42579.0,819.0,2461074.0,5546935
3,2005-01-01,2.2,110807900000.0,262100000000.0,44368.0,853.0,2497416.0,5592379
4,2006-01-01,2.5,116791300000.0,274145300000.0,46162.0,888.0,2530129.0,5627367
5,2007-01-01,2.3,122886700000.0,282965600000.0,48241.0,928.0,2546850.0,5653408
6,2008-01-01,2.3,125706600000.0,295400200000.0,49535.0,953.0,2537400.0,5684965
7,2009-01-01,1.7,124479900000.0,299102700000.0,50579.0,973.0,2460972.0,5730388
8,2010-01-01,1.0,126926300000.0,314728300000.0,51739.0,995.0,2454418.0,5788784
9,2011-01-01,1.7,131379800000.0,326281200000.0,53008.0,1019.0,2479122.0,5840241


In [63]:
yearly_economic_dataset['Date'].dtype

dtype('<M8[ns]')

In [64]:
yearly_economic_dataset.columns

Index(['Date', 'Inflation Rate', 'Total Wages in Maryland',
       'GDP for Maryland State', 'Annual Average Pay', 'Average Weekly Wage',
       'Maryland Total Workforce', 'Population'],
      dtype='object')

In [65]:
yearly_economic_dataset['Population']

0     5440389
1     5496269
2     5546935
3     5592379
4     5627367
5     5653408
6     5684965
7     5730388
8     5788784
9     5840241
10    5888375
11    5925197
12    5960064
13    5988528
14    6007014
15    6028186
16    6042153
17    6054954
18    6173689
19    6175045
20    6163981
21    6180253
Name: Population, dtype: int64

In [66]:
#Observe the new csv file that has joined on the date column
#Save the file to 'db datasets'
yearly_economic_dataset.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/combined_economic/yearly_economic_dataset.csv')

In [67]:
# Directory where your files are located
directory_monthly = '/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/precombined_economic/monthly datasets'

#List all files in monthly datasets
files_monthly = os.listdir(directory_monthly)

#Create an empty dataframe that is going to add desired values
monthly_economic_dataset = pd.DataFrame()

# Iterate through each file
for file_name in files_monthly:
    file_path = os.path.join(directory_monthly, file_name)

    #Read the CSV file
    data_monthly = pd.read_csv(file_path)
    
    #Convert 'Date' column to datetime
    data_monthly['Date'] = pd.to_datetime(data_monthly['Date'])
    
    #Set 'Date' column as index
    data_monthly.set_index('Date', inplace=True)
    data_monthly.drop(columns='Unnamed: 0',inplace=True)
    
    #Join the data based on date
    monthly_economic_dataset = pd.concat([monthly_economic_dataset, data_monthly], axis=1)

    print(f"Data joined for file: {file_name}")

Data joined for file: unemployment_not_seasonally_adjusted.csv
Data joined for file: monthly_inflation_rate.csv
Data joined for file: unemployment_seasonally_adjusted.csv


In [68]:
#Observe the new dataframe to see if it has joined the values on the date
monthly_economic_dataset.drop_duplicates()
monthly_economic_dataset

Unnamed: 0_level_0,Unemployment rate (not sa),Monthly Inflation Rate,Unemployment rate
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2002-01-01,5.5,2.6,5.0
2002-02-01,5.2,2.6,5.0
2002-03-01,5.1,2.4,5.0
2002-04-01,4.9,2.5,4.9
2002-05-01,4.7,2.5,4.9
...,...,...,...
2023-08-01,1.8,4.3,1.7
2023-09-01,1.8,4.1,1.6
2023-10-01,2.1,4.0,1.7
2023-11-01,1.8,4.0,1.8


In [69]:
#Observe the new csv file that has joined on the date column
#Save the file to 'db datasets'
monthly_economic_dataset.to_csv('/Users/ben_nicholson/Visual_Code_Projects/Personal_Projects/Maryland Car Sales Data/maryland_economics_database/combined_economic/monthly_economic_dataset.csv')