### Import a Dataset Into Jupyter

In [34]:
import pandas as pd
print(pd.__version__)

0.23.0


The data is returned as a "DataFrame" which is a 2 dimensional spreadsheet-like datastructure with columns of different types. pandas has two main data structures - the DataFrame and Series. A Series is a one-dimensional array that can hold any value type - This is not necessarily the case but a DataFrame column may be treated as a Series.

Displayed below are the first 5 rows of the DataFrame we imported (to see the last n rows use .tail(n)).

In [35]:
df = pd.read_csv("Educational_Attainment.csv")
df

Unnamed: 0,Geography,Geographytype,Year,LHSG,HSG,SCAD,BDH,Location 1,Census Designated Places,Zip Codes
0,Atherton,Town,1/1/2014 0:00,13.6,12.3,2.7,3.5,"(37.458611, -122.2)",2.0,28596
1,Colma,Town,1/1/2014 0:00,6.3,6.4,10.4,2.4,"(37.678889, -122.455556)",4.0,28588
2,Foster City,City,1/1/2014 0:00,11.9,9.7,2.0,2.9,"(37.551389, -122.266389)",6.0,319
3,Portola Valley,Town,1/1/2014 0:00,48.1,0.0,0.0,1.8,"(37.375, -122.218611)",14.0,28597
4,Redwood City,City,1/1/2014 0:00,16.4,10.6,6.6,3.0,"(37.482778, -122.236111)",21.0,28607
5,Ladera,CDP,1/1/2014 0:00,0.0,0.0,0.0,0.0,"(37.319167, -122.274167)",27.0,28591
6,Moss Beach,CDP,1/1/2014 0:00,4.8,8.2,9.3,3.5,"(37.525278, -122.512778)",30.0,28600
7,South San Francisco,City,1/1/2014 0:00,14.2,7.8,4.4,4.1,"(37.656111, -122.425556)",19.0,28613
8,Belmont,City,1/1/2014 0:00,20.9,5.9,5.0,3.6,"(37.518056, -122.291667)",15.0,28585
9,Daly City,City,1/1/2014 0:00,13.4,8.9,6.1,6.1,"(37.686389, -122.468333)",,28588


In [36]:
df.head(5)

Unnamed: 0,Geography,Geographytype,Year,LHSG,HSG,SCAD,BDH,Location 1,Census Designated Places,Zip Codes
0,Atherton,Town,1/1/2014 0:00,13.6,12.3,2.7,3.5,"(37.458611, -122.2)",2.0,28596
1,Colma,Town,1/1/2014 0:00,6.3,6.4,10.4,2.4,"(37.678889, -122.455556)",4.0,28588
2,Foster City,City,1/1/2014 0:00,11.9,9.7,2.0,2.9,"(37.551389, -122.266389)",6.0,319
3,Portola Valley,Town,1/1/2014 0:00,48.1,0.0,0.0,1.8,"(37.375, -122.218611)",14.0,28597
4,Redwood City,City,1/1/2014 0:00,16.4,10.6,6.6,3.0,"(37.482778, -122.236111)",21.0,28607


### Basic Analysis of Dataset

pandas has several methods that allow you to quickly analyze a dataset and get an idea of the type and amount of data you are dealing with along with some important statistics. 

- .shape - returns the row and column count of a dataset
- .describe() - returns statistics about the numerical columns in a dataset 
- .dtypes returns the data type of each column


In [37]:
df.shape

(32, 10)

In [38]:
df.describe()

Unnamed: 0,LHSG,HSG,SCAD,BDH,Census Designated Places,Zip Codes
count,32.0,32.0,32.0,32.0,30.0,32.0
mean,17.8,6.4625,5.946875,2.85625,17.733333,25062.09375
std,19.29944,4.693905,4.72843,1.873919,9.762466,9502.711577
min,0.0,0.0,0.0,0.0,1.0,312.0
25%,6.825,1.925,2.525,2.1,9.5,28587.75
50%,13.9,7.75,5.5,3.0,18.5,28595.0
75%,20.975,9.45,8.8,3.6,25.75,28604.25
max,100.0,16.4,18.5,9.1,34.0,28613.0


You can also run the .describe method with the "include='all'" flag to get statistics on the non-numeric column types. In this example we have to drop the "location_1" column because the .describe method doesn't accept dictionary objects.

In [39]:
df.dtypes

Geography                    object
Geographytype                object
Year                         object
LHSG                        float64
HSG                         float64
SCAD                        float64
BDH                         float64
Location 1                   object
Census Designated Places    float64
Zip Codes                     int64
dtype: object

Here are some additional methods that can give you statistics of a DataFrame or particular column in a DataFrame.
- .mean(axis=0 [will give you the calculated value per column]) - returns the statistical mean 
- .median(axis=0 [will give you the calculated value per column]) - returns the statistical median 
- .mode(axis=0 [will give you the calculated value per column]) - returns the statistical mode
- .count() - gives number of total values in column
- .unique() - returns array of all unique values in that column
- .value_counts() - returns object containing counts of unique values

In [40]:
df.BDH.mean()

2.8562500000000006

In [41]:
df.Geography.count()


32

In [42]:
df.SCAD.unique()

array([ 2.7, 10.4,  2. ,  0. ,  6.6,  9.3,  4.4,  5. ,  6.1,  6.5, 18.5,
        4.1,  1.5, 11.5,  8.7,  4.3, 10.2, 17.2,  3. ,  5.3,  6.8,  0.7,
        7. ,  9.1,  5.7,  7.4, 11.9])

In [43]:
df.LHSG.value_counts()

0.0      4
14.2     1
8.5      1
7.0      1
100.0    1
9.5      1
11.9     1
4.8      1
31.1     1
26.7     1
6.2      1
15.7     1
22.1     1
16.4     1
6.3      1
44.4     1
20.9     1
7.7      1
9.2      1
37.8     1
3.3      1
15.1     1
48.1     1
18.3     1
21.2     1
16.1     1
13.6     1
13.4     1
20.1     1
Name: LHSG, dtype: int64

### Mapping Functions to Transform Data

Often times we need to apply a function to a column in a dataset to transform it. pandas makes it easy to do with the .apply() method. In this example, we will map the values in the "geography_type" column to either a "1" or "0" depending on the value. We will append this information to the DataFrame in a new column.

In [44]:
def mapGeography(x):
    if x == "City":
        return 1
    else:
        return 0

In [46]:
df['geography_mapped_value'] = df.Geographytype.apply(mapGeography)

In [47]:
df.geography_mapped_value.value_counts()

0    17
1    15
Name: geography_mapped_value, dtype: int64

We could have also accomplished the same thing in a lambda function in the following way

In [49]:
df['geography_mapped_value_lambda'] = df.Geographytype.apply(lambda y: 1 if y == "City" else 0)

In [50]:
df.geography_mapped_value_lambda.value_counts()

0    17
1    15
Name: geography_mapped_value_lambda, dtype: int64