# Exploring Data

<img src = '../imgs/white_dear.jpg' width=300>

One of the most important steps in doing science with data is to familiarize yourself with the data, including:

* Getting a first look at the data
* How is it structured?
* Summarizing the data

**Load the Pandas library.**

In [3]:
import pandas as pd

**Read in the .csv file**

This data is courtesy of [Berkeley Earth](https://berkeleyearth.org/about/), which is affiliated with [Lawrence Berkeley National Laboratory](https://www.lbl.gov/).

Use the `read_csv` function in Pandas to read in the comma seperated variable dataset (.csv file).

In [4]:
data = pd.read_csv("../data/GlobalLandTemperaturesByMajorCity.csv")

### Getting a First Look

We put the data into a variable called `data`. In Pandas, it is stored as a "dataframe". To see it, we simply "call" that variable.

In [5]:
data

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1849-01-01,26.704,1.435,Abidjan,Côte D'Ivoire,5.63N,3.23W
1,1849-02-01,27.434,1.362,Abidjan,Côte D'Ivoire,5.63N,3.23W
2,1849-03-01,28.101,1.612,Abidjan,Côte D'Ivoire,5.63N,3.23W
3,1849-04-01,26.140,1.387,Abidjan,Côte D'Ivoire,5.63N,3.23W
4,1849-05-01,25.427,1.200,Abidjan,Côte D'Ivoire,5.63N,3.23W
...,...,...,...,...,...,...,...
239172,2013-05-01,18.979,0.807,Xian,China,34.56N,108.97E
239173,2013-06-01,23.522,0.647,Xian,China,34.56N,108.97E
239174,2013-07-01,25.251,1.042,Xian,China,34.56N,108.97E
239175,2013-08-01,24.528,0.840,Xian,China,34.56N,108.97E


**Just view the first few entries.**

In [6]:
data.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1849-01-01,26.704,1.435,Abidjan,Côte D'Ivoire,5.63N,3.23W
1,1849-02-01,27.434,1.362,Abidjan,Côte D'Ivoire,5.63N,3.23W
2,1849-03-01,28.101,1.612,Abidjan,Côte D'Ivoire,5.63N,3.23W
3,1849-04-01,26.14,1.387,Abidjan,Côte D'Ivoire,5.63N,3.23W
4,1849-05-01,25.427,1.2,Abidjan,Côte D'Ivoire,5.63N,3.23W


**Just view the last few entries.**

In [7]:
data.tail()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
239172,2013-05-01,18.979,0.807,Xian,China,34.56N,108.97E
239173,2013-06-01,23.522,0.647,Xian,China,34.56N,108.97E
239174,2013-07-01,25.251,1.042,Xian,China,34.56N,108.97E
239175,2013-08-01,24.528,0.84,Xian,China,34.56N,108.97E
239176,2013-09-01,,,Xian,China,34.56N,108.97E


### How is the data structured?

The dataframe only tells us so much about the data. It is important to get some details about the structure and data types.

In [8]:
data.shape

(239177, 7)

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239177 entries, 0 to 239176
Data columns (total 7 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   dt                             239177 non-null  object 
 1   AverageTemperature             228175 non-null  float64
 2   AverageTemperatureUncertainty  228175 non-null  float64
 3   City                           239177 non-null  object 
 4   Country                        239177 non-null  object 
 5   Latitude                       239177 non-null  object 
 6   Longitude                      239177 non-null  object 
dtypes: float64(2), object(5)
memory usage: 12.8+ MB


### Summary Statistics

Summary statistics provide an idea of the rande and distribution of the numerical data. We can get a summary of the data using the `.describe()` method.

In [10]:
data.describe()

Unnamed: 0,AverageTemperature,AverageTemperatureUncertainty
count,228175.0,228175.0
mean,18.125969,0.969343
std,10.0248,0.979644
min,-26.772,0.04
25%,12.71,0.34
50%,20.428,0.592
75%,25.918,1.32
max,38.283,14.037


**How many different cities do we have?**

In [14]:
data.drop_duplicates("City").count()

dt                               100
AverageTemperature               100
AverageTemperatureUncertainty    100
City                             100
Country                          100
Latitude                         100
Longitude                        100
dtype: int64

Your probably wondering if `drop_duplicates()` deleted a bunch of data.

In [52]:
data

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1849-01-01,26.704,1.435,Abidjan,Côte D'Ivoire,5.63N,3.23W
1,1849-02-01,27.434,1.362,Abidjan,Côte D'Ivoire,5.63N,3.23W
2,1849-03-01,28.101,1.612,Abidjan,Côte D'Ivoire,5.63N,3.23W
3,1849-04-01,26.140,1.387,Abidjan,Côte D'Ivoire,5.63N,3.23W
4,1849-05-01,25.427,1.200,Abidjan,Côte D'Ivoire,5.63N,3.23W
...,...,...,...,...,...,...,...
239172,2013-05-01,18.979,0.807,Xian,China,34.56N,108.97E
239173,2013-06-01,23.522,0.647,Xian,China,34.56N,108.97E
239174,2013-07-01,25.251,1.042,Xian,China,34.56N,108.97E
239175,2013-08-01,24.528,0.840,Xian,China,34.56N,108.97E


If we want to create a new set of data, we can assign our operation to a variable.

In [38]:
cities = data.drop_duplicates("City")
cities['City']

0                  Abidjan
1977           Addis Abeba
3942             Ahmadabad
6555                Aleppo
9224            Alexandria
11893               Ankara
14998              Baghdad
17335            Bangalore
19948              Bangkok
22319       Belo Horizonte
24500               Berlin
27739               Bogotá
30016               Bombay
32629             Brasília
34810                Cairo
37270             Calcutta
39883                 Cali
42148            Cape Town
44029           Casablanca
47038            Changchun
49356              Chengdu
51674              Chicago
54913            Chongqing
56998                Dakar
58975               Dalian
61188        Dar Es Salaam
63153                Delhi
65766                Dhaka
68379               Durban
70260           Faisalabad
72631            Fortaleza
74656                Gizeh
77116            Guangzhou
79201               Harare
81166               Harbin
83484     Ho Chi Minh City
85749            Hyderabad
8

In [37]:
pd.set_option('display.max_rows', None)
cities['City']

0                  Abidjan
1977           Addis Abeba
3942             Ahmadabad
6555                Aleppo
9224            Alexandria
11893               Ankara
14998              Baghdad
17335            Bangalore
19948              Bangkok
22319       Belo Horizonte
24500               Berlin
27739               Bogotá
30016               Bombay
32629             Brasília
34810                Cairo
37270             Calcutta
39883                 Cali
42148            Cape Town
44029           Casablanca
47038            Changchun
49356              Chengdu
51674              Chicago
54913            Chongqing
56998                Dakar
58975               Dalian
61188        Dar Es Salaam
63153                Delhi
65766                Dhaka
68379               Durban
70260           Faisalabad
72631            Fortaleza
74656                Gizeh
77116            Guangzhou
79201               Harare
81166               Harbin
83484     Ho Chi Minh City
85749            Hyderabad
8

# Data Wrangling

This is one of the hardest parts of working with data. Raw data is rarely ready to be analyzed. Scientists spend a lot of time manipulating their data to get it into a form that can be used. We call it "wrangling".

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Cats_in_aoshima_island_1.JPG/1600px-Cats_in_aoshima_island_1.JPG' width = 400>

Possible problems with data:
* Getting only the data you need 
* Incorrect data types
* Incorrect units

You can think of the data frame from two perspectives: columns and rows. Columns are much easier to work with because they are arrays, meaning all the elements in a column are the same datatype (or they should be).

### Filtering Columns

We can filter certain columns by using brackets around the column names, but we need to pass it as a list of column names, which needs to be in its own set of brackets. This is how we reference columns: `dataset_name[['column_name1', 'column_name2, ...]]`

The double brackets may seem odd, but it is because we are proving the desired columns as a list which itself is in brackets:

`list_of_columns = ['column_name1', 'column_name2, ...]` <br>
+<br>
`dataset_name[list_of_columns]`<br>
=<br>
`dataset_name[['column_name1', 'column_name2, ...]]`

In [19]:
temps = data[['dt', 'AverageTemperature', 'City', 'Country']]
temps

Unnamed: 0,dt,AverageTemperature,City,Country
0,1849-01-01,26.704,Abidjan,Côte D'Ivoire
1,1849-02-01,27.434,Abidjan,Côte D'Ivoire
2,1849-03-01,28.101,Abidjan,Côte D'Ivoire
3,1849-04-01,26.140,Abidjan,Côte D'Ivoire
4,1849-05-01,25.427,Abidjan,Côte D'Ivoire
...,...,...,...,...
239172,2013-05-01,18.979,Xian,China
239173,2013-06-01,23.522,Xian,China
239174,2013-07-01,25.251,Xian,China
239175,2013-08-01,24.528,Xian,China


### Renaming Columns

Sometimes the original column names are confusing. You can change them by setting a list of new column names to `data.columns`

In [20]:
temps.columns = ['date','ave_temp', 'city', 'country']
temps

Unnamed: 0,date,ave_temp,city,country
0,1849-01-01,26.704,Abidjan,Côte D'Ivoire
1,1849-02-01,27.434,Abidjan,Côte D'Ivoire
2,1849-03-01,28.101,Abidjan,Côte D'Ivoire
3,1849-04-01,26.140,Abidjan,Côte D'Ivoire
4,1849-05-01,25.427,Abidjan,Côte D'Ivoire
...,...,...,...,...
239172,2013-05-01,18.979,Xian,China
239173,2013-06-01,23.522,Xian,China
239174,2013-07-01,25.251,Xian,China
239175,2013-08-01,24.528,Xian,China


### What about rows?

Rows are a little trickier to reference because they are not arrays. In columns, we can reference the name of the columns that we are interested in. Also, the elements in a column are the same data type. Since rows contain the elements of several different arrays, they can contain multiple data types and require a different way of thinking. We can filter rows two ways: by index or by label.

For the `iloc` method, you can pass a number `[5]`, range of numbers `[5:9]`, list `[23, 45, 87]`. These values correspond to the bold index values in the first column.

`iloc`

In [23]:
# pass a value
temps.iloc[5]

date           1849-06-01
ave_temp           24.844
city              Abidjan
country     Côte D'Ivoire
Name: 5, dtype: object

In [46]:
# pass a range
temps.iloc[16:19]

Unnamed: 0,date,ave_temp,city,country,ave_temp_F
16,1850-05-01,25.379,Abidjan,Côte D'Ivoire,77.6822
17,1850-06-01,24.903,Abidjan,Côte D'Ivoire,76.8254
18,1850-07-01,24.04,Abidjan,Côte D'Ivoire,75.272


In [47]:
# pass a list
temps.iloc[[23, 45, 87]]

Unnamed: 0,date,ave_temp,city,country,ave_temp_F
23,1850-12-01,26.014,Abidjan,Côte D'Ivoire,78.8252
45,1852-10-01,,Abidjan,Côte D'Ivoire,
87,1856-04-01,26.998,Abidjan,Côte D'Ivoire,80.5964


`loc`

The `loc` method allows you to search for labels within the columns. You can use values, booleans, and conditionals.

In [24]:
# pass a value
temps.loc[5]

date           1849-06-01
ave_temp           24.844
city              Abidjan
country     Côte D'Ivoire
Name: 5, dtype: object

In [26]:
# pass a conditional
temps.loc[temps['city'] == 'Chicago']

Unnamed: 0,date,ave_temp,city,country
51674,1743-11-01,5.436,Chicago,United States
51675,1743-12-01,,Chicago,United States
51676,1744-01-01,,Chicago,United States
51677,1744-02-01,,Chicago,United States
51678,1744-03-01,,Chicago,United States
...,...,...,...,...
54908,2013-05-01,13.734,Chicago,United States
54909,2013-06-01,17.913,Chicago,United States
54910,2013-07-01,21.914,Chicago,United States
54911,2013-08-01,22.230,Chicago,United States


In [27]:
# pass multiple conditionals
temps.loc[(temps['city'] == 'Chicago') & (temps['ave_temp'] >= 20)]

Unnamed: 0,date,ave_temp,city,country
51682,1744-07-01,21.680,Chicago,United States
51754,1750-07-01,23.713,Chicago,United States
51755,1750-08-01,23.249,Chicago,United States
51766,1751-07-01,22.473,Chicago,United States
51767,1751-08-01,22.787,Chicago,United States
...,...,...,...,...
54887,2011-08-01,23.414,Chicago,United States
54898,2012-07-01,25.909,Chicago,United States
54899,2012-08-01,22.778,Chicago,United States
54910,2013-07-01,21.914,Chicago,United States


### Working with Values

The temperatures for all the cities are in Celsius. Let's convert them to Fahrenheit so they are easier to understand.

$$ T(F) = \frac{9}{5}T(C)+32 $$

**Define a function for converting Celsius to Fahrenheit.**

In [38]:
def Celsius_to_Fahrenheit(temp_C):
    temp_F = (temp_C * 9/5) + 32
    return temp_F

**"Apply" the function to the data.**

In [51]:
temps = temps.copy()
temps['ave_temp_F'] = temps['ave_temp'].apply(Celsius_to_Fahrenheit)
temps.head()

Unnamed: 0,date,ave_temp,city,country,ave_temp_F
0,1849-01-01,26.704,Abidjan,Côte D'Ivoire,80.0672
1,1849-02-01,27.434,Abidjan,Côte D'Ivoire,81.3812
2,1849-03-01,28.101,Abidjan,Côte D'Ivoire,82.5818
3,1849-04-01,26.14,Abidjan,Côte D'Ivoire,79.052
4,1849-05-01,25.427,Abidjan,Côte D'Ivoire,77.7686
