# Project 1 - Data Analysis in Python and Pandas
## Importing Building Data from NYC Open Data
 - The data is imported as a csv, downloaded to the local directory from [NYC Open Data Building Energy](https://data.cityofnewyork.us/Environment/NYC-Building-Energy-and-Water-Data-Disclosure-for-/5zyy-y8am/about_data) and renamed *nyc_building_energy.csv*.
 - Using the DictReader method the data is read and turned into a dictionary. The keys for each row are then printed to see the labels of each column in the file.

In [2]:
import plotly
import csv
with open('nyc_building_energy.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row.keys())
        break

dict_keys(['Calendar Year', 'Property ID', 'Property Name', 'Parent Property ID', 'Parent Property Name', 'Year Ending', 'NYC Borough, Block and Lot (BBL)', 'NYC Building Identification Number (BIN)', 'Address 1', 'City', 'Postal Code', 'Primary Property Type - Self Selected', 'Primary Property Type - Portfolio Manager-Calculated', 'National Median Reference Property Type', 'List of All Property Use Types (GFA) (ft²)', 'Largest Property Use Type', 'Largest Property Use Type - Gross Floor Area (ft²)', '2nd Largest Property Use Type', '2nd Largest Property Use Type - Gross Floor Area (ft²)', '3rd Largest Property Use Type', '3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built', 'Construction Status', 'Number of Buildings', 'Occupancy', 'Metered Areas (Energy)', 'Metered Areas (Water)', 'ENERGY STAR Score', 'National Median ENERGY STAR Score', 'Target ENERGY STAR Score', 'Reason(s) for No Score', 'ENERGY STAR Certification - Year(s) Certified (Score)', 'Eligible for Certi

## Using Python to Understand the Data:
### Making the Data Readable
 - Again using the DictReader method, the data is added to a list called *data*. The first row of data is printed.

In [4]:
data = []
with open('nyc_building_energy.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        data.append(row)

data[:1]

[{'Calendar Year': '2022',
  'Property ID': '6414946',
  'Property Name': '58-01 Grand Avenue',
  'Parent Property ID': 'Not Applicable: Standalone Property',
  'Parent Property Name': 'Not Applicable: Standalone Property',
  'Year Ending': '12/31/2022',
  'NYC Borough, Block and Lot (BBL)': '4026780001',
  'NYC Building Identification Number (BIN)': '4059918',
  'Address 1': '58-01 Grand Avenue',
  'City': 'Queens',
  'Postal Code': '11378',
  'Primary Property Type - Self Selected': 'Non-Refrigerated Warehouse',
  'Primary Property Type - Portfolio Manager-Calculated': 'Non-Refrigerated Warehouse',
  'National Median Reference Property Type': 'CBECS - Non-Refrigerated Warehouse',
  'List of All Property Use Types (GFA) (ft²)': 'Non-Refrigerated Warehouse (51749.0)',
  'Largest Property Use Type': 'Non-Refrigerated Warehouse',
  'Largest Property Use Type - Gross Floor Area (ft²)': '51749',
  '2nd Largest Property Use Type': 'Not Available',
  '2nd Largest Property Use Type - Gross Fl

- Once the data is in a list, a new list is created to filter the data down to just include the Total Greenhouse Gas Emissions data, converting the text to a float and removing any rows that do not have numeric data.

In [6]:
ghg_emissions =[]
ghg_emissions = [float(row['Total (Location-Based) GHG Emissions (Metric Tons CO2e)']) for row in data if row['Total (Location-Based) GHG Emissions (Metric Tons CO2e)']!= 'Not Available']

### Finding the Mean Greenhouse Gas Emissions Using Basic Python
- Using the list *ghg_emissions* and the sum function, the total greenhouse gas emissions is calculated.
- Next, the mean is found by dividing the total amount by the length of the original list, *ghg_emissions*.

In [8]:
total_ghg = sum(ghg_emissions)
mean_ghg_emissions = total_ghg/len(ghg_emissions)
print('The mean greenhouse gas emissions by buildings in NYC is', mean_ghg_emissions, 'metric tons of CO2e')

The mean greenhouse gas emissions by buildings in NYC is 914.1075328485776 metric tons of CO2e


### Finding the Median Greenhouse Gas Emissions Using Basic Python
- First, the list *ghg_emissions* is sorted using the .sort method.
- Next, the length of the list is calculated to be able to index the middle of the list.
- Because the length of the list is even, the median is calculated using the mean of the middle two values of the list.

In [10]:
ghg_emissions.sort()
length = len(ghg_emissions)
length

62712

In [11]:
ghg_emissions[(length//2)]

292.2

In [12]:
median = (ghg_emissions[(length//2)] + ghg_emissions[((length//2)-1)])/2
print('The median amount of GHG emissions for buildings in NYC is', median,'metric tons of CO2e')

The median amount of GHG emissions for buildings in NYC is 292.2 metric tons of CO2e


### Finding the Mode Greenhouse Gas Emissions Using Basic Python
- To find the mode, an empty dictionary *counts* is setup and a for loop is used to go through the list *ghg_emissisions* and add each value to the dictionary as a key and then add 1 if the value was already in the dictionary

In [14]:
counts = {}
for row in ghg_emissions:
    if row in counts:
        counts[row]+=1
    else:
        counts[row]=1
counts

{-43.9: 1,
 -36.2: 1,
 0.0: 3133,
 0.1: 10,
 0.2: 8,
 0.3: 11,
 0.4: 13,
 0.5: 11,
 0.6: 18,
 0.7: 14,
 0.8: 8,
 0.9: 8,
 1.0: 4,
 1.1: 12,
 1.2: 5,
 1.3: 6,
 1.4: 9,
 1.5: 5,
 1.6: 6,
 1.7: 10,
 1.8: 4,
 1.9: 7,
 2.0: 7,
 2.1: 2,
 2.2: 4,
 2.3: 4,
 2.4: 9,
 2.5: 3,
 2.6: 6,
 2.7: 7,
 2.8: 10,
 2.9: 6,
 3.0: 7,
 3.1: 9,
 3.2: 9,
 3.3: 9,
 3.4: 7,
 3.5: 3,
 3.6: 5,
 3.7: 5,
 3.8: 3,
 3.9: 7,
 4.0: 7,
 4.1: 9,
 4.2: 8,
 4.3: 8,
 4.4: 6,
 4.5: 3,
 4.6: 7,
 4.7: 3,
 4.8: 3,
 4.9: 9,
 5.0: 7,
 5.1: 9,
 5.2: 9,
 5.3: 10,
 5.4: 8,
 5.5: 4,
 5.6: 8,
 5.7: 4,
 5.8: 7,
 5.9: 4,
 6.0: 5,
 6.1: 5,
 6.2: 7,
 6.3: 4,
 6.4: 8,
 6.5: 5,
 6.6: 7,
 6.7: 2,
 6.8: 4,
 6.9: 5,
 7.0: 9,
 7.1: 6,
 7.2: 4,
 7.3: 10,
 7.4: 8,
 7.5: 7,
 7.6: 5,
 7.7: 3,
 7.8: 2,
 7.9: 6,
 8.0: 4,
 8.1: 8,
 8.2: 6,
 8.3: 4,
 8.4: 6,
 8.5: 8,
 8.6: 4,
 8.7: 3,
 8.8: 5,
 8.9: 1,
 9.0: 5,
 9.1: 4,
 9.2: 6,
 9.3: 3,
 9.4: 7,
 9.5: 7,
 9.6: 8,
 9.7: 4,
 9.8: 6,
 9.9: 7,
 10.0: 2,
 10.1: 2,
 10.2: 9,
 10.3: 5,
 10.4: 1,
 10.5: 3,
 10.

In [15]:
mode = counts[0.0]
print(f'The mode amount of GHG emissions is 0.0 with {mode} entries')

The mode amount of GHG emissions is 0.0 with 3133 entries


## Using Pandas:
### Turn the Data into a Dataframe
- Using the pandas package, the data is read into a dataframe using the .read_csv method.

In [17]:
import pandas as pd

In [18]:
building_energy_df = pd.read_csv('nyc_building_energy.csv')

  building_energy_df = pd.read_csv('nyc_building_energy.csv')


### Exploring and Updating the Data Type
- Filtering for the column of interest, the .info method is used to see what kind of data is in the relevant column.

In [20]:
building_energy_df['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 64169 entries, 0 to 64168
Series name: Total (Location-Based) GHG Emissions (Metric Tons CO2e)
Non-Null Count  Dtype 
--------------  ----- 
64169 non-null  object
dtypes: object(1)
memory usage: 501.4+ KB


- The column is converted from an object to a float using the .to_numeric method
- The .info method is then used to check that the change in data types worked as expected

In [22]:
building_energy_df['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'] = pd.to_numeric(building_energy_df['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'], errors ='coerce')

In [23]:
building_energy_df['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 64169 entries, 0 to 64168
Series name: Total (Location-Based) GHG Emissions (Metric Tons CO2e)
Non-Null Count  Dtype  
--------------  -----  
62712 non-null  float64
dtypes: float64(1)
memory usage: 501.4 KB


### Understanding the Data
- Using the .groupby method, we can start to get a better understand the frequency of entries, 

In [25]:
with pd.option_context("display.max_rows", 1000):
    display(building_energy_df.groupby("Total (Location-Based) GHG Emissions (Metric Tons CO2e)").size())

Total (Location-Based) GHG Emissions (Metric Tons CO2e)
-43.9            1
-36.2            1
 0.0          3133
 0.1            10
 0.2             8
              ... 
 846700.1        1
 846700.4        1
 2206862.8       1
 3134531.6       1
 3139711.6       1
Length: 15661, dtype: int64

- To make sure the series only contains the numeric data, a new slice is created to exclude the rows with "Not Available" 

In [27]:
valid_totals = building_energy_df[building_energy_df['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'] != 'Not Available']
valid_totals['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 64169 entries, 0 to 64168
Series name: Total (Location-Based) GHG Emissions (Metric Tons CO2e)
Non-Null Count  Dtype  
--------------  -----  
62712 non-null  float64
dtypes: float64(1)
memory usage: 501.4 KB


- Before using the median, mean, and mode methods, a new slice is created and sorted to check the possible values of the methods

In [29]:
counts = valid_totals.groupby("Total (Location-Based) GHG Emissions (Metric Tons CO2e)").size().sort_values(ascending=False).reset_index(name="count")

In [30]:
counts

Unnamed: 0,Total (Location-Based) GHG Emissions (Metric Tons CO2e),count
0,0.0,3133
1,179.6,27
2,218.7,25
3,182.7,24
4,194.0,23
...,...,...
15656,1344.9,1
15657,1344.5,1
15658,1344.2,1
15659,1343.5,1


- Using the .mean method, we find the same mean as the non-pandas process
- The .median and .mode methods are then used and the same results are found

In [32]:
pd_mean = valid_totals['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].mean()
print('The mean of the data is', pd_mean, 'metric tons of CO2e')

The mean of the data is 914.1075328485776 metric tons of CO2e


In [33]:
pd_median = valid_totals['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].median()
print('The median of the data is', pd_median,'metric tons of CO2e')

The median of the data is 292.2 metric tons of CO2e


In [34]:
pd_mode = valid_totals['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].mode()
print('The mode of the data is', pd_mode, 'metric tons of CO2e')

The mode of the data is 0    0.0
Name: Total (Location-Based) GHG Emissions (Metric Tons CO2e), dtype: float64 metric tons of CO2e


The above results match the results from the grouped and sorted dataframe as well as the results from the prior section.

## Creating a Simple Visualization
- First the data is simplified using the groupby method to create a smaller table that can then be visualized. Here, the data is grouped by year.

In [37]:
yearly_emissions = valid_totals.groupby('Calendar Year')['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].sum().reset_index()
yearly_emissions

Unnamed: 0,Calendar Year,Total (Location-Based) GHG Emissions (Metric Tons CO2e)
0,2022,24574861.9
1,2023,32750649.7


- Next, the .info method is used to better understand the type of data within each column.

In [39]:
yearly_emissions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   Calendar Year                                            2 non-null      int64  
 1   Total (Location-Based) GHG Emissions (Metric Tons CO2e)  2 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 164.0 bytes


- Lastly, a for loop is used to convert the year into a string and the GHG emissions into an integer. The ghg emissions are then divided by 1 million and then multiplied by the * icon. The result is a simple visual that shows a * for every million metric tons of CO2 of greenhouse gas emissions in each year.

In [41]:
print('NYC Total Building Greenhouse Gas Emissions (Millions of Metric Tons of CO2e):')
for index, row in yearly_emissions.iterrows():
    # print (row)
    year = str(int(row['Calendar Year']))
    co2 = (row['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'])
    partspermill = int(co2/1000000)
    print(str(int(row['Calendar Year']))), print(partspermill*'*')

NYC Total Building Greenhouse Gas Emissions (Millions of Metric Tons of CO2e):
2022
************************
2023
********************************


In [42]:
# def year_list():
#     year = []
#     for index, row in yearly_emissions.iterrows():
#         # print (row)
#         str_year = [str(int(row['Calendar Year']))]
#         year.append(str_year)
        
#     return year

# def ppm_list():
#     ppm_list = []
#     for index, row in yearly_emissions.iterrows():
#         co2 = (row['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'])
#         ppm = int(co2/1000000)
#         ppm_visual = [ppm*'*']
#         ppm_list.append(ppm_visual)
#     return ppm_list
# ppm_list()
# year_list()
# data = {'Year': year_list,
#         'Parts Per Million': ppm_list}
# data_visual = pd.DataFrame(data)
# data_visual

In [43]:
# city_emissions = valid_totals.groupby('City')['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].sum().reset_index()
# city_emissions

In [44]:
# property_type_emissions = valid_totals.groupby('City')['Total (Location-Based) GHG Emissions (Metric Tons CO2e)'].sum().reset_index()
# property_type_emissions