### Step 1: Import needed libraries

One of the reasons Python is so popular is because of its rich and comprehensive set of _third party libraries_ that focus on certain functionality that developers need.

Some common libraries - and the associated purpose for each - are listed below: 
- [`pandas`](https://pandas.pydata.org/pandas-docs/stable/): Provides common data analysis tools and data structures - especially [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)s.
- [`requests`](https://requests.readthedocs.io/en/master/): Whenever you need to make an API request to GET or POST data programmatically from a source like [Colorado Information Marketplace](https://data.colorado.gov/) or the [NREL API](https://developer.nrel.gov/), requests allows you easily make these API calls 
- [`numpy`](https://numpy.org/): Library focused on scientific computing, as well as performing operations common to various data structures
- [`scipy`](https://www.scipy.org/): Similar to numpy in that the library focuses on scientific computing, `scipy` is more focused on complex mathematical/scientific operations. 
- [`matplotlib`](https://matplotlib.org/): Plotting library focused on data visualization

These are just a few of the 200,000+ libraries available within the [Python ecosystem](https://pypi.org/).

Below is the syntax to import these third-party libraries. Note that these imports will not work unless you have previously installed them from the `Pipenv` file using the command `pipenv install` from your terminal. (<< that's kinda jargony so let me know if this needs to be clarified)

In [None]:
import pandas as pd
import git
import os
import numpy as np

### Step 2: Find appropriate path that includes our data set

To structure this project, I decided to create a folder called `data` which includes all the CSVs needed for this tutorial.

We'll be working with the file `Alternative_Energy_Laws_and_Incentives_in_Colorado_2014.csv`, downloaded from the Colorado Information Marketplace (CIM) [here](https://data.colorado.gov/Energy/Alternative-Energy-Laws-and-Incentives-in-Colorado/nxw4-ev8w). According to this site, this CSV includes data on:

> _Law titles, text and dates for biofuels, natural gas, plug in electric and more categories from National Renewable Energy Laboratory (NREL) since 2007 and updated annually after each state’s legislative session ends_

The library [`GitPython`](https://gitpython.readthedocs.io/en/stable/) - imported above with the syntax `import git` and used below through the code snippet `git.Repo()` - is a library used to interact git repositories

In [None]:
DATA_DIR_NAME = 'data'
GIT_ROOT_PATH = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
data_dir_full_path = os.path.join(GIT_ROOT_PATH, DATA_DIR_NAME)

print("The data we'll be working with is in this directory:\n{}".format(
    data_dir_full_path
))

In [None]:
# find all files in data_dir_full_path
files = os.listdir(data_dir_full_path)

# list comprehension, extremely common and useful way to iterate through an array/list
files_csv = [i for i in files if i.endswith('.csv')]
# the above line goes through each file in the data directory, creating an array of filenames ending in .csv

# takes the first (and only) .csv file from the files_csv array. Joins with data directory to get full path
co_energy_laws_csv_path = os.path.join(data_dir_full_path, 'Alternative_Energy_Laws_and_Incentives_in_Colorado_2014.csv')

In [None]:
# read in the csv from the path defined above. 
# `lawid` is the unique identifier of the data set we're working with, so we can set it as index_col
co_energy_laws_df = pd.read_csv(co_energy_laws_csv_path, index_col = 'lawid')

# show the first few rows in the dataset
co_energy_laws_df.head(n=3)

In [None]:
print("columns in dataset:\n{}".format(list(co_energy_laws_df.columns)))

print("\nnum rows in dataset: {}\n".format(co_energy_laws_df.shape[0]))

print("Unique types of legislation: {}\n".format(
    co_energy_laws_df.type.unique()) # alternate syntax, same result: co_energy_laws_df['type'].unique() 
)

# iterate through the first 5 rows of the data set, and access the values of each row
for idx, law in co_energy_laws_df[0:5].iterrows():
    print("----------------")
    print("Legislation Name (Law ID {}): {}".format(idx, law['title']))
    print("----------------")
    print(law['text'])
    print("\n")

In [None]:
#value_counts() groups the given column name, and provides the number of rows for each value in the column
co_energy_laws_df['type'].value_counts()

In [None]:
# notice that this is a pipe-delimited column
co_energy_laws_df['technologycategories'].value_counts()

#the format of this column isn't conducive to analysis 
#bc there are multiple pieces of information in the column.

#For example, the rows with `ELEC|HEV` represent legislation 
#encouraging two technologies - electric vehicles *and* hybrids.

#We'll need to manipulate/transform these columns into
#something more useful

In [None]:
# get all values of the technologycategories column
tech_cats_array = co_energy_laws_df['technologycategories'].values

# concat all of these values together with | 
tech_cats_concat = "|".join(tech_cats_array)

print("all values concatenated together\n: {}\n".format(tech_cats_concat))

unique_tech_cats = list(set(tech_cats_concat.split("|")))
print("unique technologies:\n{}".format(unique_tech_cats))

In [None]:
# next, we'll create one column per technology.
# this column will be a "boolean" (values are only True or False)

for tech_cat in unique_tech_cats:
    
    co_energy_laws_df["tech_{}_flg".format(tech_cat)] = \
        co_energy_laws_df.apply(lambda row: 
        True if tech_cat in row['technologycategories'] else False
    , axis=1)


In [None]:
co_energy_laws_df.columns

In [None]:
# to check the transformation we ran on the data set (adding columns),
# let's look at a few of the columns we created
co_energy_laws_df[['technologycategories', 'tech_ELEC_flg', 'tech_BIOD_flg']].head()

In [None]:
# because there are multiple columns that have pipe-delimited values,
# we can generalize the code created for `technologycategories` above 
# to create a function that we can call for the other columns

def create_boolean_cols(df, col_name, new_col_name_prepend, sep = "|"):
    
    orig_col_vals_array = df[col_name].values
    orig_col_vals_array = [x for x in orig_col_vals_array if x == x]
    df[col_name] = df[col_name].astype(str)
    
    cats_concat = sep.join(orig_col_vals_array)
    unique_cats = list(set(cats_concat.split(sep)))
    
    print("\n-------------- start column name: {}".format(col_name))
    print("unique values:\n{}".format(unique_cats))
    
    col_prepend = "{}_".format(new_col_name_prepend)
    
    print("building out new boolean columns...")
    for cat in unique_cats:
    
        df["{col_prepend}{cat}_flg".format(**locals())] = \
            df.apply(lambda row: 
            True if cat in row[col_name] else False
        , axis=1)
        
    print("-------------- end column name: {}".format(col_name))
    return df

In [None]:
augmented_co_energy_laws_df = create_boolean_cols(
    co_energy_laws_df, 
    "incentivecategories", #col_name
    "incentive"
)

augmented_co_energy_laws_df = create_boolean_cols(
    co_energy_laws_df, 
    "regulationcategories", #col_name
    "reg"
)

augmented_co_energy_laws_df = create_boolean_cols(
    co_energy_laws_df, 
    "usercategories", #col_name
    "user"
)

In [None]:
augmented_co_energy_laws_df.columns

In [None]:
augmented_co_energy_laws_df[["usercategories", "user_FLEET_flg", "user_AFS_flg"]]