# Pandas
Pandas is an open source library providing high-performance, easy to use data structures and data analysis tools for the Python Programming language. It is based on dataframe concept found in the R programming language.

The following code loads the MPEG dataset into a dataframe:

In [5]:
# Simple dataframe
import os
import pandas as pd

# Importing miles per gallon dataset
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/auto-mpg.csv")
print(df[0:5])

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  


The **display** function provides a cleaner display than merely printing the data frame. Specifying the maximum rows and columns allows you to achieve greater control over the display

In [7]:
pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 5)
display(df)

Unnamed: 0,mpg,cylinders,displacement,...,year,origin,name
0,18.0,8,307.0,...,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,...,70,1,buick skylark 320
...,...,...,...,...,...,...,...
396,28.0,4,120.0,...,82,1,ford ranger
397,31.0,4,119.0,...,82,1,chevy s-10


It is possible to generate a second data frame to display statistical information about the first data frame.

In [8]:
# Strip non-numerics
df = df.select_dtypes(include=['int', 'float'])

headers = list(df.columns.values)
fields = []

for field in headers:
    fields.append({
        'name' : field,
        'mean': df[field].mean(),
        'var': df[field].var(),
        'sdev': df[field].std()
    })

for field in fields:
    print(field)

{'name': 'mpg', 'mean': 23.514572864321615, 'var': 61.089610774274405, 'sdev': 7.815984312565782}
{'name': 'displacement', 'mean': 193.42587939698493, 'var': 10872.199152247364, 'sdev': 104.26983817119581}
{'name': 'acceleration', 'mean': 15.568090452261291, 'var': 7.604848233611381, 'sdev': 2.7576889298126757}


This code outputs a list of dictionaries that hold this statistical information. This information looks similar to the JSON code seen in Module 1. To as proper JSON, the program should add these records to a list and call the Python JSON library's dumps command called.

The Python program can convert this JSON-like information to a data frame for better display.

In [13]:
df2 = pd.DataFrame(fields)
display(df2)

Unnamed: 0,name,mean,var,sdev
0,mpg,23.514573,61.089611,7.815984
1,displacement,193.425879,10872.199152,104.269838
2,acceleration,15.56809,7.604848,2.757689


## Missing Values
Missing values are a reality of machine learning. Ideally, every row of data will have values for all columns. However, this is rarely the case. Most of the values are present in the MPG database. However, there are missing values in the horsepower column. A common practice is to replace missing values with the median value for that column. The program calculates the median as described here. The following code replaces any NA values in horsepower with the median:

In [14]:
import os
import pandas as pd

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])
print(f"horsepower has na? {pd.isnull(df['horsepower']).values.any()}")
    
print("Filling missing values...")
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
# df = df.dropna() # you can also simply drop NA values

print(f"horsepower has na? {pd.isnull(df['horsepower']).values.any()}")

horsepower has na? True
Filling missing values...
horsepower has na? False
