# Python Intro: Data Operations with Pandas

#### Handling relational data with SQL vs Python
SQL is great at handling relational data, and is highly optimized for the things it can do.
However, Python offers much more complex functionality, and since it is imperative (as opposed to declarative), you have control to do a lot more.
<ul>
<li />Data exploration and manipulation a lot easier.
<li />Many packages for complex mathematical and data analyses.
<li />Plotting.
</ul>
<br>
Use SQL for simple, efficient operations on large databases. Use Python for more complex analyses.

Required Datasets:
- CityTemps.csv
- TempsRegions.csv or OurTempsRegions.csv

Python's <b>PANDAS</b> library.

### 데이터들을 이미 다 정리해 놔서 리스트나 그런 형태를 신경 쓸 필요가 없이 간단

In [1]:
#correlation x와 y가 상관이 없고 동시에 일어남 causation x가 먼저 일어나고 y가 일어남. 서로 상관이 있음                                                      
#correlation 에 숨겨진 변수들이 있고 그 변수들을 통하면 인과관계가 있기도 함. confounding values
#
#


# pandas is data analysis library
import pandas as pd

PANDAS stores tabular (relational) data in <i>Data Frames</i>.

In [3]:
CT_dataframe = pd.read_csv("CityTemps.csv")

In [4]:
print CT_dataframe

              city           state   lat    lng  temp
0           Mobile         Alabama  31.2   88.5    44
1       Montgomery         Alabama  32.9   86.8    38
2          Phoenix         Arizona  33.6  112.5    35
3      Little Rock        Arkansas  35.4   92.8    31
4      Los Angeles      California  34.3  118.7    47
5    San Francisco      California  38.4  123.0    42
6           Denver        Colorado  40.7  105.3    15
7        New Haven     Connecticut  41.7   73.4    22
8       Wilmington        Delaware  40.5   76.3    26
9       Washington              DC  39.7   77.5    30
10    Jacksonville         Florida  31.0   82.3    45
11        Key West         Florida  25.0   82.0    65
12           Miami         Florida  26.3   80.7    58
13         Atlanta         Georgia  33.9   85.0    37
14           Boise           Idaho  43.7  117.1    22
15         Chicago        Illinois  42.3   88.0    19
16    Indianapolis         Indiana  39.8   86.9    21
17      Des Moines          

### Quick Summary Statistics

In [3]:
CT_dataframe.describe()

Unnamed: 0,lat,lng,temp
count,56.0,56.0,56.0
mean,38.969643,90.9625,26.517857
std,5.378541,14.966697,13.379755
min,25.0,70.5,0.0
25%,35.55,78.625,18.75
50%,39.8,87.8,24.5
75%,42.625,98.45,33.25
max,48.1,123.2,65.0


### Number of Records in Dataframe

In [4]:
len(CT_dataframe)

56

### Obtaining first 5 rows

In [5]:
CT_dataframe.head(5)

Unnamed: 0,city,state,lat,lng,temp
0,Mobile,Alabama,31.2,88.5,44
1,Montgomery,Alabama,32.9,86.8,38
2,Phoenix,Arizona,33.6,112.5,35
3,Little Rock,Arkansas,35.4,92.8,31
4,Los Angeles,California,34.3,118.7,47


### Selecting the first 5 cities

In [6]:
cities = CT_dataframe['city'].head(5)

In [7]:
cities

0         Mobile
1     Montgomery
2        Phoenix
3    Little Rock
4    Los Angeles
Name: city, dtype: object

### Getting the maximum temperature

In [8]:
# finding max
CT_dataframe['temp'].max()

65

### Sorting

In [9]:
# sorting by temperature, and then state
CT_dataframe.sort_values(by = ['temp', 'state'], ascending=[True, False]).head(5)

Unnamed: 0,city,state,lat,lng,temp
36,Bismarck,North Dakota,47.1,101.0,0
25,Minneapolis,Minnesota,45.9,93.9,2
49,Burlington,Vermont,45.0,73.9,7
27,Helena,Montana,47.1,112.4,8
53,Madison,Wisconsin,43.4,90.2,9


### SELECT FROM WHERE 
** Equivalent SQL query:**  
Select *  
From CityTemps  
Where Temp > 40

In [6]:
# select-where queries; two ways
CT_dataframe[(CT_dataframe.temp > 40)]
# same as below

Unnamed: 0,city,state,lat,lng,temp
0,Mobile,Alabama,31.2,88.5,44
4,Los Angeles,California,34.3,118.7,47
5,San Francisco,California,38.4,123.0,42
10,Jacksonville,Florida,31.0,82.3,45
11,Key West,Florida,25.0,82.0,65
12,Miami,Florida,26.3,80.7,58
20,New Orleans,Louisiana,30.8,90.2,45
46,Galveston,Texas,29.4,95.5,49
47,Houston,Texas,30.1,95.9,44


In [11]:
CT_dataframe[(CT_dataframe['temp'] > 40)]

Unnamed: 0,city,state,lat,lng,temp
0,Mobile,Alabama,31.2,88.5,44
4,Los Angeles,California,34.3,118.7,47
5,San Francisco,California,38.4,123.0,42
10,Jacksonville,Florida,31.0,82.3,45
11,Key West,Florida,25.0,82.0,65
12,Miami,Florida,26.3,80.7,58
20,New Orleans,Louisiana,30.8,90.2,45
46,Galveston,Texas,29.4,95.5,49
47,Houston,Texas,30.1,95.9,44


### Joining
** Equivalent SQL query:**  
Select *  
From CityTemps, TempsRegions  
Where CityTemps.state = TempsRegions.state

In [12]:
TR_dataframe = pd.read_csv("TempsRegions.csv")
join = CT_dataframe.merge(TR_dataframe, on='state')

In [13]:
join.head(9) #only see the first 9 rows

Unnamed: 0,city_x,state,lat_x,lng_x,temp_x,city_y,lat_y,lng_y,temp_y,region,coastal
0,Mobile,Alabama,31.2,88.5,44,Mobile,31.2,88.5,44,Southcentral,Y
1,Mobile,Alabama,31.2,88.5,44,Montgomery,32.9,86.8,38,Southcentral,Y
2,Montgomery,Alabama,32.9,86.8,38,Mobile,31.2,88.5,44,Southcentral,Y
3,Montgomery,Alabama,32.9,86.8,38,Montgomery,32.9,86.8,38,Southcentral,Y
4,Phoenix,Arizona,33.6,112.5,35,Phoenix,33.6,112.5,35,Mountain,N
5,Little Rock,Arkansas,35.4,92.8,31,Little Rock,35.4,92.8,31,Southcentral,N
6,Los Angeles,California,34.3,118.7,47,Los Angeles,34.3,118.7,47,Pacific,Y
7,Los Angeles,California,34.3,118.7,47,Palo Alto,37.4,122.1,39,Pacific,Y
8,Los Angeles,California,34.3,118.7,47,San Francisco,38.4,123.0,42,Pacific,Y


### Group By
.sum( ) returns the sum  
.mean( ) returns the average  
.size( ) returns the count  

** Equivalent SQL query:**  
Select state, avg(lat), avg(lng), avg(temp)  
From Joined  
group by state

In [20]:
CT_dataframe.groupby('state').mean()

Unnamed: 0_level_0,lat,lng,temp
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,32.05,87.65,41.0
Arizona,33.6,112.5,35.0
Arkansas,35.4,92.8,31.0
California,36.35,120.85,44.5
Colorado,40.7,105.3,15.0
Connecticut,41.7,73.4,22.0
DC,39.7,77.5,30.0
Delaware,40.5,76.3,26.0
Florida,27.433333,81.666667,56.0
Georgia,33.9,85.0,37.0
