# Pandas Primer - 2
- Reading files
    - CSV files
    - Excel files
    - URL
- Exporting data
- Exploring data
    - Summarizing data
    - Visualizing data

In [None]:
import pandas as pd

## 1. Reading files
- Pandas supports reading files from diverse source
    - CSV file
    - Excel file
    - etc
- Data in files are saved as DataFrame

### CSV files
- CSV files can be read using ```read_csv()``` function

In [None]:
# reading csv file
# with default parameter settings, index is set [0, 1, ... , n-1] (when there are n rows)
# first row is used as header (column names)
df = pd.read_csv('glass.csv')
print(df.head())
print(df.columns)
print(df.index)
print(df.shape)

In [None]:
# reading csv file with parameter settings
# header is set to None => use arbitray column names, not first row
# use first column as index (IDs from 1 to n)
df = pd.read_csv('glass.csv', header = None, index_col = 0)
print(df.head())
print(df.columns)
print(df.index)
print(df.shape)

In [None]:
# designating column names
# each element in col_names list is set to each column name
col_names = ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Type']
df = pd.read_csv('glass.csv', header = None, index_col = 0, names = col_names)
print(df.head())
print(df.columns)
print(df.index)
print(df.shape)

### Excel files
- Excel files(```.xlsx``` or ```.xls```) can be read using ```read_excel()``` function

### CPU data
- Relative performance of different CPUs
- source: https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/
- Number of instances (# rows): 209
- Number of attributes (# cols): 10
    - vendor name: adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, 
       dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, 
       microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, 
       sratus, wang)
    - Model Name : many unique symbols
    - MYCT: machine cycle time in nanoseconds (integer)
    - MMIN: minimum main memory in kilobytes (integer)
    - MMAX: maximum main memory in kilobytes (integer)
    - CACH: cache memory in kilobytes (integer)
    - CHMIN: minimum channels in units (integer)
    - CHMAX: maximum channels in units (integer)
    - PRP: published relative performance (integer)
    - ERP: estimated relative performance from the original article (integer)

In [None]:
# reading excel file without parameter settings
# note that first row is used as header as default
df = pd.read_excel('cpu.xlsx')
print(df.head())
print(df.columns)
print(df.index)
print(df.shape)

In [None]:
# if there are several sheets, sheet_name parameter should be set
df = pd.read_excel('cpu.xlsx', sheet_name = 'Sheet1')
print(df.head())

### Reading data from URL
- Read data directly from URL using ```read_table()```, ```read_csv()```, or ```read_excel()``` 
    - Which function to use? *Depends on the situation*
    - Try out all three of them, and satisfice when results are met
- Valid URL schemes include ```http, ftp, s3```, and file

In [None]:
# using read_table() to read data from url
df = pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data' \
                  , sep = ',')
print(df.head())

In [None]:
# using read_csv() to read data from url
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df = pd.read_csv(url)
print(df.head())

## 2. Exporting data
- Using Pandas, it is also convenient to export data 

In [None]:
l1 = [1., 2., 3., 4., 5.]
l2 = [1, 2, 3, 4, 5]
l3 = ['a', 'b', 'c', 'd', 'e']
l4 = ['A', 'B', 'C', 'D', 'E']
df = pd.DataFrame({'float': l1, 'int': l2, 'lower': l3, 'upper': l4})
print(df)

In [None]:
# saving CSV file
df.to_csv('csv_example.csv')

In [None]:
# saving excel file
df.to_excel('excel_example.xlsx')

## 3. Exploring data
- Pandas provides convenient ways to explore data

### Summarizing data

In [None]:
# we keep our example with CPU data
df = pd.read_excel('cpu.xlsx')
print(df.head())       # first 5 rows
print(df.tail())       # last 5 rows

In [None]:
# summarizing whole DataFrame
# columns with numerical values are described
print(df.describe())

In [None]:
# describing only one column at a time
print(df['MYCT'].describe())

### Visualizing data

In [None]:
# plotting dataset
import matplotlib.pyplot as plt
df.plot()
plt.show()

In [None]:
# plotting variable (column)
df['MYCT'].plot()
plt.show()

In [None]:
# plotting multiple columns at a time
df[['MYCT', 'CACH']].plot()
plt.show()

In [None]:
# what about categorical variables?
counts = df['vendor_name'].value_counts()
counts.plot(kind = 'bar')
plt.show()

### Exercise 2-1.
- Import dataset as DataFrame in below URL, by three methods
    - Export directly via URL
    - Save dataset as CSV file, and import
    - Save dataset as Excel file, and import
- URL: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

In [None]:
## Your answer

### Exercise 2-2.
- Describe first three variables (columns) of dataset imported above
- Plot three variables 
- Count number of instances in last column (car name), and identify which car names occurs most frequently

In [None]:
## Your answer

### Exercise 2-3.
- Convert above DataFrame into NumPy array (excluding last column)
- Compute average and standard deviation of each column
     - use ```mean()``` and ```std()``` functions
- Perform same action to original DataFrame
    - Are the results identical?

In [None]:
## Your answer