In [2]:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pickle

### Import functions unique to this project

In [3]:
from master_functions import get_car_urls
from master_functions import make_model_df

## GMC - Create master dataframe for all models

This worksheet is intended to build a datafame for all **GMC** cars.  There are some 'quirks' in how the data is structured from its source on https://www.fueleconomy.gov/, so more manual steps are taken below to check files for issues, combine first what is 'normal' and then add in those that required special attention.

**Step #1:** Create unique urls for every car model for the years 1984 - 2021<br>
- Uses `get_car_urls` from master function, Inputs: (car_make, [list of all models])

In [5]:
gmc_urls = get_car_urls('GMC',
                        ['Acadia','C2500 Sierra','Caballero Pickup',
                         'Canyon','Envoy','EV1','Jimmy',
                         'K2500 Sierra','Pickup','Rally',
                         'S15 Cab Chassis','S15 Pickup',
                         'S15 Utility Body','Safari','Safari Cargo',
                         'Safari Cargo','Safari Passenger',
                         'Sierra','Sonoma','Suburban',
                         'Terrain','Vandura','Yukon'
                        ])

-

**Step #2:** Get length of list created in Step 1.  This number will be how many times you run the function in Step 3 to check all of the urls<br>

In [6]:
# Verify number of urls and use this number
# to know how many urls need to 'check below'

len(gmc_urls)

23

-

**Step #3:** Check all of the urls you just created.<br>
- If does not work, add to 'problem' URLs string below this cell

In [30]:
# Test area for each url with [carmake]_urls[index]
# by seeing if data appears correctly

make_model_df('GMC',gmc_urls[22])

Unnamed: 0,year,make,model,capacity_liters,cylinders,transmission,trans_speed,fuel_type,gg_emissions,mpg
0,2010,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,404,22
1,2009,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
2,2008,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
3,2010,GMC,Yukon 1500 Hybrid 4WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
4,2013,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,418,21
5,2012,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
6,2011,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
7,2013,GMC,Yukon 1500 Hybrid 4WD,6.0,8,Automatic,variable,Regular Gasoline,416,21
8,2012,GMC,Yukon 1500 Hybrid 4WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
9,2011,GMC,Yukon 1500 Hybrid 4WD,6.0,8,Automatic,variable,Regular Gasoline,423,21


-

Populate this section so if there are any 'problem' URLs from your test above

In [31]:
#'Problem' URLs
'''
gmc_urls[5]

'''

#Print list length again to 
#set length of range in next cell
len(gmc_urls)

23

-

**Step #4:** Create dfs for all 'okay' urls and place each into a master list
- Automate where possible, but some may need to be added one by one to avoid 'problem' urls

In [32]:
# for 'normal' urls to make a df and add to master df list, automate it!

gmc_dfs = []

for x in range(0,5):
    gmc_dfs.append(make_model_df('GMC',gmc_urls[x]))
    
for x in range(6,23):
    gmc_dfs.append(make_model_df('GMC',gmc_urls[x]))


-

**Step #5:** Concatenate all of the 'normal' car model dfs into one master dataframe

In [33]:
gmc_dfs = pd.concat(gmc_dfs, ignore_index=True)

gmc_dfs

Unnamed: 0,year,make,model,capacity_liters,cylinders,transmission,trans_speed,fuel_type,gg_emissions,mpg
0,2020,GMC,Acadia FWD,2.0,4,Automatic,9,Regular Gasoline,362,24
1,2020,GMC,Acadia AWD,2.0,4,Automatic,9,Regular Gasoline,374,23
2,2020,GMC,Acadia FWD,2.5,4,Automatic,9,Regular Gasoline,382,23
3,2019,GMC,Acadia FWD,2.5,4,Automatic,6,Regular Gasoline,383,23
4,2018,GMC,Acadia FWD,2.5,4,Automatic,6,Regular Gasoline,386,23
...,...,...,...,...,...,...,...,...,...,...
207,2012,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
208,2011,GMC,Yukon 1500 Hybrid 2WD,6.0,8,Automatic,variable,Regular Gasoline,423,21
209,2013,GMC,Yukon 1500 Hybrid 4WD,6.0,8,Automatic,variable,Regular Gasoline,416,21
210,2012,GMC,Yukon 1500 Hybrid 4WD,6.0,8,Automatic,variable,Regular Gasoline,423,21


-

**Step #6:** Pickle the dataframe made in Step 6 of all car's models with 'normal' dataframes
- Will now be saved so further work on dataframe can start at this place

In [34]:
with open('pickles/gmc_dfs.pickle', 'wb') as to_write:
    pickle.dump(gmc_dfs, to_write)

-

**Step #7:** Un-pickle the dataframe made in Step 7 of all car's models with 'normal' dataframes

In [35]:
with open('pickles/gmc_dfs.pickle','rb') as read_file:
    gmc_dfs = pickle.load(read_file)