# Project 5: Communicate Data Findings

São Paulo, 15 June of 2019<br>
Felipe Mahlmeister

## Table of Contents

1. [Summary](#summary)<br>
2. [Extracting the Data](#extract)<br>
3. [Analysis, Modeling, and Validation](#analysis)<br>
4. [Conclusion](#conclusion)<br>

<a id='summary'></a>
## 1. Summary

In this project, I will analyze local and global temperature data and compare the temperature trends where I live to overall global temperature trends.

<a id='summary'></a>
## 2. Preliminary Wrangling

> Briefly introduce your dataset here.

In this project, I will analyze local and global temperature data and compare the temperature trends where I live to overall global temperature trends.

In [1]:
# import all packages and set plots to be embedded inline
import os
import numpy as np
import time
import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from urllib.request import urlretrieve
import bz2

%matplotlib inline

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

1. user input year range
2. url and filepath list with get_url
3. 

In [2]:
# Choose the range of years you want to download
start_year = 2007
end_year = 2008

In [3]:
def get_url(start_year, end_year):
    
    # Create lists
    url, filepath = [], []
    
    for year in range(start_year,end_year+1):
    
        # Create full url string
        url_str = 'http://stat-computing.org/dataexpo/2009/'+str(year)+'.csv.bz2'

        # Append url string to the list
        url.append(url_str)

        # Create full filepath string
        filepath_str = 'source/'+str(year)+'.csv.bz2'

        # Append filepath string to the list
        filepath.append(filepath_str)

    return url, filepath

In [4]:
url, filepath = get_url(start_year, end_year)

In [5]:
def get_download_and_unzip(filepath, url, force_download=False):
    
    # download_start_time_list = d_start_l
    # download_end_time_list = d_end_l
    # unzip_start_time_list = u_start_l
    # unzip_end_time_list = u_end_l
    
    d_start_l, d_end_l = [], []
    u_start_l, u_end_l = [], []
    
    if force_download or not os.path.exists(filepath[:-4]):
        
        # -------- Calculate download time: start
        start_1 = time.time()
        
        urlData = requests.get(url)

        with open(filepath, mode ='wb') as file:
            file.write(urlData.content)
            
        # -------- Calculate download time: end
        end_1 = time.time()
        
        # -------- Calculate unzip time: start
        start_2 = time.time()
        
        # Open the file
        zipfile = bz2.BZ2File(filepath)
        
        # Get the decompressed data
        data = zipfile.read()
        
        # Assuming the filepath ends with .bz2
        newfilepath = filepath[:-4]
        
        # Write a uncompressed file
        open(newfilepath, 'wb').write(data)
        
        # -------- Calculate unzip time: end
        end_2 = time.time()
           
        # Delete zip files from source folder
        os.remove(filepath)
        
        # Add execution time to list
        
        d_start_l.append(start_1)
        d_end_l.append(end_1)
        u_start_l.append(start_2)
        u_end_l.append(end_2)
    else:
        print("All the files have already been downloaded")
    return d_start_l, d_end_l, u_start_l, u_end_l

In [6]:
def get_flights_data(url, filepath):
    
    total_d_start_l, total_d_end_l = [], []
    total_u_start_l, total_u_end_l = [], []
    total_d_diff_l, total_u_diff_l = [], []

    !mkdir source

    for file in range(0,len(url)):

        d_start_l, d_end_l, u_start_l, u_end_l = get_download_and_unzip(filepath[file], url[file])

        total_d_start_l.append(d_start_l)
        total_d_end_l.append(d_end_l)
        total_u_start_l.append(u_start_l)
        total_u_end_l.append(u_end_l)

        total_d_diff = total_d_end_l[file][0]-total_d_start_l[file][0]
        total_d_diff_l.append(total_d_diff)

        total_u_diff = total_u_end_l[file][0]-total_u_start_l[file][0]
        total_u_diff_l.append(total_u_diff)

        total_diff = total_d_diff + total_u_diff

        print('')
        print(filepath[file][7:-4],':','file',file+1,'of',len(url),'were successfully downloaded and unzipped in',
              '{:0.2f}'.format(total_diff),'seconds')

    total_download_time = sum(total_d_diff_l)
    total_unzip_time = sum(total_u_diff_l)
    total_download_unzip = total_download_time + total_unzip_time

    print('-----------------------------------')
    print('total download time:','{:0.2f}'.format(total_download_time),'seconds')
    print('total unzip time:','{:0.2f}'.format(total_unzip_time),'seconds')
    print('')
    print('total execution time: ','{:0.2f}'.format((total_download_unzip)/60), 'minutes')

    statinfo = []

    for file in range(0,len(filepath)):

        file_bytes = os.stat(filepath[file][:-4]).st_size

        file_gb = file_bytes/1024**3

        statinfo.append(file_gb)

    print('total size of downloaded files:','{:0.2f}'.format(sum(statinfo)),'GB')

In [7]:
get_flights_data(url, filepath)


2007.csv : file 1 of 2 were successfully downloaded and unzipped in 58.67 seconds

2008.csv : file 2 of 2 were successfully downloaded and unzipped in 43.73 seconds
-----------------------------------
total download time: 39.84 seconds
total unzip time: 62.56 seconds

total execution time:  1.71 minutes
total size of downloaded files: 1.30 GB


In [None]:
total_d_start_l, total_d_end_l = [], []
total_u_start_l, total_u_end_l = [], []
total_d_diff_l, total_u_diff_l = [], []

!mkdir source

for file in range(0,len(url)):
        
    d_start_l, d_end_l, u_start_l, u_end_l = get_download_and_unzip(filepath[file], url[file])
    
    total_d_start_l.append(d_start_l)
    total_d_end_l.append(d_end_l)
    total_u_start_l.append(u_start_l)
    total_u_end_l.append(u_end_l)
    
    total_d_diff = total_d_end_l[file][0]-total_d_start_l[file][0]
    total_d_diff_l.append(total_d_diff)
    
    total_u_diff = total_u_end_l[file][0]-total_u_start_l[file][0]
    total_u_diff_l.append(total_u_diff)
    
    total_diff = total_d_diff + total_u_diff
    
    print('')
    print(filepath[file][7:-4],':','file',file+1,'of',len(url),'were successfully downloaded and unzipped in',
          '{:0.2f}'.format(total_diff),'seconds')

total_download_time = sum(total_d_diff_l)
total_unzip_time = sum(total_u_diff_l)
total_download_unzip = total_download_time + total_unzip_time

print('-----------------------------------')
print('total download time:','{:0.2f}'.format(total_download_time),'seconds')
print('total unzip time:','{:0.2f}'.format(total_unzip_time),'seconds')
print('')
print('total execution time: ','{:0.2f}'.format((total_download_unzip)/60), 'minutes')

statinfo = []

for file in range(0,len(filepath)):
    
    file_bytes = os.stat(filepath[file][:-4]).st_size
    
    file_gb = file_bytes/1024**3
    
    statinfo.append(file_gb)
    
print('total size of downloaded files:','{:0.2f}'.format(sum(statinfo)),'GB')

In [None]:
def get_flights_dataframe()

# Calculate the time of execution
start_3 = time.time()

data = pd.read_csv(filepath[:-4])

# Calculate the time of execution
end_3 = time.time()

print('read df - execution time: ',end_3 - start_3, 'seconds')
print('read df - execution time: ',(end_3 - start_3)/60, 'minutes')
print('-----------------------------------------------------')
print('total execution time: ',(end_1+end_2+end_3)-(start_1+start_2+start_3), 'seconds')
print('total execution time: ',((end_1+end_2+end_3)-(start_1+start_2+start_3))/60, 'minutes')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
def test(url):
    print(url)

In [None]:
URL_1 = 'http://stat-computing.org/dataexpo/2009/2008.csv.bz2'
URL_2 = 'http://stat-computing.org/dataexpo/2009/2007.csv.bz2'

a = test(URL_1)
b = test(URL_2)

In [None]:
# Calculate the time of execution
start_4 = time.time()

df_1 = pd.read_csv('source/2008.csv.bz2', compression='bz2', header=0, sep=',')
        
# Calculate the time of execution
end_4 = time.time()

In [None]:
print('-----------------------------------------------------')
print('read df - execution time: ',end_4 - start_4, 'seconds')
print('read df - execution time: ',(end_4 - start_4)/60, 'minutes')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# create directories for downloads
!mkdir data

In [None]:
# Calculate the time of execution
start_1 = time.time()

url = 'http://stat-computing.org/dataexpo/2009/2008.csv.bz2'
urlData = requests.get(url)

with open('data/2008.csv.bz2', mode ='wb') as file:
    file.write(urlData.content)
    
# Calculate the time of execution
end_1 = time.time()
print('-----------------------------------------------------')
print('execution time: ',end_1 - start_1, 'seconds')
print('execution time: ',(end_1 - start_1)/60, 'minutes')

In [None]:
# Calculate the time of execution
start_2 = time.time()

filepath = 'data/2008.csv.bz2'
zipfile = bz2.BZ2File(filepath) # open the file
data = zipfile.read() # get the decompressed data
newfilepath = filepath[:-4] # assuming the filepath ends with .bz2
open(newfilepath, 'wb').write(data) # write a uncompressed file

# Calculate the time of execution
end_2 = time.time()
print('-----------------------------------------------------')
print('execution time: ',end_2 - start_2, 'seconds')
print('execution time: ',(end_2 - start_2)/60, 'minutes')

In [None]:
# Load image predictions file

# Calculate the time of execution
start_3 = time.time()

df = pd.read_csv('data/2008.csv')

# Calculate the time of execution
end_3 = time.time()
print('-----------------------------------------------------')
print('execution time: ',end_3 - start_3, 'seconds')
print('execution time: ',(end_3 - start_3)/60, 'minutes')

In [None]:
print('-----------------------------------------------------')
print('execution time: ',(end_1+end_2+end_3)-(start_1+start_2+start_3), 'seconds')
print('execution time: ',((end_1+end_2+end_3)-(start_1+start_2+start_3))/60, 'minutes')

In [None]:
df.head()

In [None]:
df.shape

### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!