## Pandas

Pandas is a python package to handle tabular data (csv and excel files, and SQL databases).
- a ton of methods and functions to manipulate data (we only cover a very small fraction on day 1-2 of week 4)
- you can visualize your data with pandas (days 3-4 of week 4)
- excellent documentation
- ton of discussion on stackoverflow.com


In [1]:
import pandas as pd

### Some notes and advice

- **ALWAYS READ THE HELP OF THE METHODS/FUNCTIONS YOU USE!**

- stackoverflow is your friend, use it! https://stackoverflow.com/


### By the end of the day you'll be able to
- write a csv file from a pandas dataframe
- read in csv and excel files, and sql databases
- filter rows in your data frame,
- filter columns.

### <font color='LIGHTGRAY'>By the end of the day you'll be able to</font>
- **write a csv file from a pandas dataframe**
- <font color='LIGHTGRAY'>read in csv and excel files, and sql databases</font>
- <font color='LIGHTGRAY'>filter rows in your data frame,</font>
- <font color='LIGHTGRAY'>filter columns.</font>


In [2]:
names_list = ['Ashley', 'Andras', 'Rihanna', 'Emily']
ages_list = [30, 36, 28, 33]
birthplaces_list = ['USA', 'Hungary', 'Barbados', 'USA']
singers_list = [False, False, True, False]
people_dict = {
    "name": names_list,
    "age": ages_list,
    "birthplace": birthplaces_list,
    "is_singer": singers_list
}
people_df = pd.DataFrame(people_dict)
people_df

Unnamed: 0,name,age,birthplace,is_singer
0,Ashley,30,USA,False
1,Andras,36,Hungary,False
2,Rihanna,28,Barbados,True
3,Emily,33,USA,False


In [3]:
people_df.to_csv('data/people.csv')

### <font color='LIGHTGRAY'>By the end of the day you'll be able to</font>
- <font color='LIGHTGRAY'>write a csv file from a pandas dataframe</font>
- **read in csv and excel files, and sql databases**
- <font color='LIGHTGRAY'>filter rows in your data frame,</font>
- <font color='LIGHTGRAY'>filter columns.</font>


In [4]:
# how to read in a database into a dataframe and basic dataframe structure
# load data from a csv file
df = pd.read_csv('data/adult_data.csv') # there are also pd.read_excel(), and pd.read_sql()

print(df)
#print(df.head()) # by default, shows the first five rows but check help(df.head) to specify the number of rows to show
#print(df.shape) # the shape of your dataframe (number of rows, number of columns)
#print(df.shape[0]) # number of rows
#print(df.shape[1]) # number of columns

       age          workclass  fnlwgt    education  education-num  \
0       39          State-gov   77516    Bachelors             13   
1       50   Self-emp-not-inc   83311    Bachelors             13   
2       38            Private  215646      HS-grad              9   
3       53            Private  234721         11th              7   
4       28            Private  338409    Bachelors             13   
...    ...                ...     ...          ...            ...   
32556   27            Private  257302   Assoc-acdm             12   
32557   40            Private  154374      HS-grad              9   
32558   58            Private  151910      HS-grad              9   
32559   22            Private  201490      HS-grad              9   
32560   52       Self-emp-inc  287927      HS-grad              9   

            marital-status          occupation    relationship    race  \
0            Never-married        Adm-clerical   Not-in-family   White   
1       Married-civ-spo

### DataFrame structure: both rows and columns are indexed!
- index column, no name
    - contains the row names
    - by default, index is a range object from 0 to number of rows - 1 
    - any column can be turned into an index, so indices can be non-number, and also non-unique. more on this later.
- columns with column names on top

### Always print your dataframe to check if it looks ok!

### Most common reasons it might not look ok:

   - the first row is not the column name
        - there are rows above the column names that need to be skipped
        - there is no column name but by default, pandas assumes the first row is the column name. as a result, 
          the values of the first row end up as column names.
   - character encoding is off
   - separator is not comma but some other charachter

In [5]:
# check the help to find the solution
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer: Union[ForwardRef('PathLike[str]'), str, IO[~T], io.RawIOBase, io.BufferedIOBase, io.TextIOBase, _io.TextIOWrapper, mmap.mmap], sep=<object object at 0x7f95bc30b2c0>, delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whit

## Exercise 1

How should we read in adult_test.csv properly? This file is in 'data/adult_test.csv'. Identify and fix the problem.

In [6]:
# add your code below



### <font color='LIGHTGRAY'>By the end of the day you'll be able to</font>
- <font color='LIGHTGRAY'>write a csv file from a pandas dataframe</font>
- <font color='LIGHTGRAY'>read in csv and excel files, and sql databases</font>
- **filter rows in your data frame,**
- <font color='LIGHTGRAY'>filter columns.</font>


### How to select rows?

##### 1) Integer-based indexing, numpy arrays are indexed the same way.
##### 2) Select rows based on the value of the index column
##### 3) select rows based on column condition

### 1) Integer-based indexing, lists are indexed the same way.


In [7]:
# df.iloc[] - for more info, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-integer
# iloc is how numpy arrays are indexed (non-standard python indexing)

# [start:stop:step] -  general indexing format

# start stop step are optional
print(df.iloc[:])
#print(df.iloc[::])
#print(df.iloc[::1])

# select one row - 0-based indexing
#print(df.iloc[3])

# indexing from the end of the data frame
#print(df.iloc[-2]) 

       age          workclass  fnlwgt    education  education-num  \
0       39          State-gov   77516    Bachelors             13   
1       50   Self-emp-not-inc   83311    Bachelors             13   
2       38            Private  215646      HS-grad              9   
3       53            Private  234721         11th              7   
4       28            Private  338409    Bachelors             13   
...    ...                ...     ...          ...            ...   
32556   27            Private  257302   Assoc-acdm             12   
32557   40            Private  154374      HS-grad              9   
32558   58            Private  151910      HS-grad              9   
32559   22            Private  201490      HS-grad              9   
32560   52       Self-emp-inc  287927      HS-grad              9   

            marital-status          occupation    relationship    race  \
0            Never-married        Adm-clerical   Not-in-family   White   
1       Married-civ-spo

In [8]:
# select a slice - stop index not included
print(df.iloc[3:7])

# select every second element of the slice - stop index not included
#print(df.iloc[3:7:2])

#print(df.iloc[3:7:-2]) # return empty dataframe
#print(df.iloc[7:3:-2])#  return rows with indices 7 and 5. 3 is the stop so it is not included

# can be used to reverse rows
#print(df.iloc[::-1])

# here is where indexing gets non-standard python
# select the 2nd, 5th, and 10th rows
#print(df.iloc[[1,4,9]]) # such indexing doesn't work with lists but it works with numpy arrays

   age workclass  fnlwgt   education  education-num          marital-status  \
3   53   Private  234721        11th              7      Married-civ-spouse   
4   28   Private  338409   Bachelors             13      Married-civ-spouse   
5   37   Private  284582     Masters             14      Married-civ-spouse   
6   49   Private  160187         9th              5   Married-spouse-absent   

           occupation    relationship    race      sex  capital-gain  \
3   Handlers-cleaners         Husband   Black     Male             0   
4      Prof-specialty            Wife   Black   Female             0   
5     Exec-managerial            Wife   White   Female             0   
6       Other-service   Not-in-family   Black   Female             0   

   capital-loss  hours-per-week  native-country gross-income  
3             0              40   United-States        <=50K  
4             0              40            Cuba        <=50K  
5             0              40   United-States       


### 2) Select rows based on the value of the index column

In [9]:
# df.loc[] - for more info, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label

print(df.index) # the default index when reading in a file is a range index. In this case,
                 # .loc and .iloc works ALMOST the same.
# one difference:
#print(df.loc[3:9:2]) # this selects the 4th, 6th, 8th, 10th rows - the stop element is included!

help(df.set_index)

RangeIndex(start=0, stop=32561, step=1)
Help on method set_index in module pandas.core.frame:

set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False) method of pandas.core.frame.DataFrame instance
    Set the DataFrame index using existing columns.
    
    Set the DataFrame index (row labels) using one or more existing
    columns or arrays (of the correct length). The index can replace the
    existing index or expand on it.
    
    Parameters
    ----------
    keys : label or array-like or list of labels/arrays
        This parameter can be either a single column key, a single array of
        the same length as the calling DataFrame, or a list containing an
        arbitrary combination of column keys and arrays. Here, "array"
        encompasses :class:`Series`, :class:`Index`, ``np.ndarray``, and
        instances of :class:`~collections.abc.Iterator`.
    drop : bool, default True
        Delete columns to be used as the new index.
    append : bool, d

In [10]:
df_index_age = df.set_index('age',drop=False)

#print(df_index_age.index)
#print(df_index_age.head())

#print(df_index_age.loc[30].head()) # collect everyone with age 30 - the index is non-unique

#print(df_index_age.loc[30:35]) # non-default index cannot be sliced. 
                               # this does not return everyone between ages of 30 and 35

### 3) select rows based on column condition

In [11]:
# one condition
print(df[df['age']==30].head())
# here is the condition: it's a boolean series - series is basically a dataframe with one column
#print(df['age']==30)

# multiple conditions can be combined with & (and) | (or)
#print(df[(df['age']>30)&(df['age']<35)].head())
#print(df[(df['age']==90)|(df['native-country']==' Hungary')])

    age     workclass  fnlwgt      education  education-num  \
11   30     State-gov  141297      Bachelors             13   
33   30   Federal-gov   59951   Some-college             10   
59   30       Private  188146        HS-grad              9   
60   30       Private   59496      Bachelors             13   
88   30       Private   54334            9th              5   

         marital-status          occupation    relationship  \
11   Married-civ-spouse      Prof-specialty         Husband   
33   Married-civ-spouse        Adm-clerical       Own-child   
59   Married-civ-spouse   Machine-op-inspct         Husband   
60   Married-civ-spouse               Sales         Husband   
88        Never-married               Sales   Not-in-family   

                   race    sex  capital-gain  capital-loss  hours-per-week  \
11   Asian-Pac-Islander   Male             0             0              40   
33                White   Male             0             0              40   
59      

### Exercise 2
How many people in adult_data.csv work at least 60 hours a week and have a doctorate?

### <font color='LIGHTGRAY'>By the end of the day you'll be able to</font>
- <font color='LIGHTGRAY'>write a csv file from a pandas dataframe</font>
- <font color='LIGHTGRAY'>read in csv and excel files, and sql databases</font>
- <font color='LIGHTGRAY'>filter rows in your data frame,</font>
- **filter columns.**

In [12]:
columns =  df.columns
print(columns)

# select columns by column name
#print(df[['age','hours-per-week']])
#print(columns[[1,5,7]])
#print(df[columns[[1,5,7]]])

# select columns by index using iloc
#print(df.iloc[:,3])

# select columns by index - not standard python indexing
#print(df.iloc[:,[3,5,6]])

# select columns by index -  standard python indexing
#print(df.iloc[:,::2])


Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'gross-income'],
      dtype='object')
