

<h1>Pandas</h1>
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 


import pandas library and other related libraries

In [None]:
from pandas import DataFrame, read_csv
# General syntax to import a library but no functions: 
##import (library) as (give the library a nickname/alias)
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number

<h1>Learning objective</h1> 

We will use an example from a real research project and use **Python pandas** to walk through steps that social scientists usually take to handle a raw data.


## Data example

We are going to use an example data retrieved from Stack Overflow.

Stack Overflow is a question and answer website. Users can ask and answer questions related to programming. Here is an example question and all the answers:

https://stackoverflow.com/questions/4/how-to-convert-a-decimal-to-a-double-in-c

Stack Overflow data is publicly available, which allows researchers to observe user behavior in the community.

Here is an overview of the data schema:

https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede


<h1>Import csv data</h1>


To pull in the csv file, we will use the pandas function *read_csv*. Let us take a look at this function and what inputs it takes.

In [None]:
read_csv?

Common parameters in **read_csv**

read_csv(filepath, sep= , header= , usecols=None)

- filepath_or_buffer : str, path object or file-like object
   
- sep : str, default ','
    
- header : int, list of int
 - Row number(s) to use as the column names, and the start of the data.  
    
- names : array-like, optional
 - List of column names to use, if the file contains a header row
   
- usecols : list-like or callable
 - Return a subset of the columns. 
 - For example, a valid list-like
    `usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
   


In [None]:
from google.colab import files
uploaded = files.upload()

Saving so_question.csv to so_question.csv


In [None]:
#import in colab environment
import io
df = pd.read_csv(io.BytesIO(uploaded['so_question.csv']))

In [None]:
#import from local drive
Location = r'G:\My Drive\CCSS\so_question.csv'
df = pd.read_csv(Location, header=None)

In [None]:
df

Unnamed: 0,8441,2008-08-12 03:58:21,.htaccess,2008,8,8454,57
0,3157,2008-08-06 08:15:28,.htaccess,2008,8,4449.0,476.0
1,14061,2008-08-18 00:49:33,.net,2008,8,14337.0,615.0
2,22623,2008-08-22 15:12:15,.net,2008,8,22628.0,357.0
3,29988,2008-08-27 12:40:40,.net,2008,8,30001.0,3191.0
4,9376,2008-08-13 01:05:47,.net,2008,8,,1083.0
...,...,...,...,...,...,...,...
9994,30211,2008-08-27 14:06:52,zip,2008,8,30473.0,1414.0
9995,17250,2008-08-20 00:16:40,zip,2008,8,,394.0
9996,12823,2008-08-15 22:09:05,zipcode,2008,8,12875.0,350.0
9997,25,2008-08-01 12:13:50,zos,2008,8,,23.0


This brings us to the first problem of the exercise. The **read_csv** function treated the first record in the csv file as the header names. This is obviously not correct since the text file did not provide us with header names.

To correct this we will pass the **header** parameter to the read_csv function and set it to **None** (means null in python).

In [None]:
df = pd.read_csv(io.BytesIO(uploaded['so_question.csv']), header=None)
df

Unnamed: 0,0,1,2,3,4,5,6
0,8441,2008-08-12 03:58:21,.htaccess,2008,8,8454.0,57.0
1,3157,2008-08-06 08:15:28,.htaccess,2008,8,4449.0,476.0
2,14061,2008-08-18 00:49:33,.net,2008,8,14337.0,615.0
3,22623,2008-08-22 15:12:15,.net,2008,8,22628.0,357.0
4,29988,2008-08-27 12:40:40,.net,2008,8,30001.0,3191.0
...,...,...,...,...,...,...,...
9995,30211,2008-08-27 14:06:52,zip,2008,8,30473.0,1414.0
9996,17250,2008-08-20 00:16:40,zip,2008,8,,394.0
9997,12823,2008-08-15 22:09:05,zipcode,2008,8,12875.0,350.0
9998,25,2008-08-01 12:13:50,zos,2008,8,,23.0


If we wanted to give the columns specific names, we would have to pass another parameter called **names**. We can also omit the header parameter.

In [None]:
df = pd.read_csv(io.BytesIO(uploaded['so_question.csv']), names=['id', 'creation_time', 'tag', 'year','month', 'accepted_id','owner_user_id'])
df

Unnamed: 0,id,creation_time,tag,year,month,accepted_id,owner_user_id
0,8441,2008-08-12 03:58:21,.htaccess,2008,8,8454.0,57.0
1,3157,2008-08-06 08:15:28,.htaccess,2008,8,4449.0,476.0
2,14061,2008-08-18 00:49:33,.net,2008,8,14337.0,615.0
3,22623,2008-08-22 15:12:15,.net,2008,8,22628.0,357.0
4,29988,2008-08-27 12:40:40,.net,2008,8,30001.0,3191.0
...,...,...,...,...,...,...,...
9995,30211,2008-08-27 14:06:52,zip,2008,8,30473.0,1414.0
9996,17250,2008-08-20 00:16:40,zip,2008,8,,394.0
9997,12823,2008-08-15 22:09:05,zipcode,2008,8,12875.0,350.0
9998,25,2008-08-01 12:13:50,zos,2008,8,,23.0


You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these are part of the index of the dataframe. You can think of the **index** as the primary key of a sql table with the exception that an index is allowed to have duplicates.

## Sanity check
Usually we first do a couple of sanity checks to make sure the data we imported is in good shape. 

Usually we make sure the data type of each variable is sensible, check the total number of observations, and the number of variables in the data set.

Let's take a look at the example data set.


In [None]:
# Check data type of the columns
#object???
df.dtypes

id                 int64
creation_time     object
tag               object
year               int64
month              int64
accepted_id      float64
owner_user_id    float64
dtype: object

In [None]:
# Check data type of a specific column
df.accepted_id.dtype

dtype('float64')

check the number of observations

In [None]:
len(df.index)

10000

check the number of variables (columns)

In [None]:
len(df.columns)

7

## Slicing and Indexing
Sometimes we have a large data set, we want to view for certain rows or columns. Slicing and indexing would help us get a slice of the data set.

In [None]:
df[['id', 'creation_time']]

Unnamed: 0,id,creation_time
0,9,2008-07-31 23:40:59
1,39,2008-08-01 12:43:11
2,14,2008-08-01 00:59:11
3,104,2008-08-01 15:12:34
4,59,2008-08-01 13:14:33
...,...,...
95,88,2008-08-01 14:36:18
96,16,2008-08-01 04:59:33
97,88,2008-08-01 14:36:18
98,108,2008-08-01 15:22:29


query for the 5th row

In [None]:
df[4:5]

Unnamed: 0,id,creation_time,tag,year,month,accepted_id,owner_user_id
4,59,2008-08-01 13:14:33,.net-3.5,2008,8,,45.0


get question id in the 5th row

In [None]:
df[['id']][4:5]

Unnamed: 0,id
4,59


## Excersize 1

What is the tag of the 50th question in the data set?

In [None]:
df[['tag']][49:50]

Unnamed: 0,tag
49,html


## Time span of the data set

Usually when we have time series data, we want to know the time span of the data set and there are multiple ways to achieve that.

- Sort the dataframe and select the top row
- Use the min() attribute to find the min value

In [None]:
# Method 1:
Sorted = df.sort_values(['creation_time'], ascending=True)
Sorted.head(1)

Unnamed: 0,id,creation_time,tag,year,month,accepted_id,owner_user_id
90,4,2008-07-31 21:42:52,type-conversion,2008,7,7.0,8.0


In [None]:
# Method 2:
df['creation_time'].min()

'2008-07-31 21:42:52'

Sometimes we want to know the values a certain variable can take, such as state names and country names. And we can use **unique** to tell us the set of values of a given variable.

In this example data, each questions have multiple tags and we are interested in the set of tags appear in the data.

In [None]:
df['tag'].unique()

array(['.net', '.net-3.5', '64-bit', 'actionscript-3', 'air', 'algorithm',
       'apache-flex', 'aptana', 'architecture', 'arrays', 'binary-data',
       'branch', 'branching-and-merging', 'browser', 'c', 'c#', 'c++',
       'com-interop', 'css', 'data-storage', 'database', 'datatable',
       'datediff', 'datetime', 'decimal', 'double', 'eclipse',
       'file-type', 'flat-file', 'floating-point', 'form-submit', 'forms',
       'hook', 'html', 'internet-explorer-7', 'language-agnostic', 'linq',
       'linux', 'mainframe', 'math', 'memory-leaks', 'mime', 'mysql',
       'office-2007', 'performance', 'php', 'pi', 'plugins', 'rdbms',
       'rdoc', 'relative-time-span', 'ruby', 'sockets', 'sql',
       'sql-server', 'subclipse', 'submit-button', 'svn', 'time', 'timer',
       'timezone', 'timezone-offset', 'tortoisesvn', 'triggers',
       'type-conversion', 'unix', 'user-agent', 'vb.net', 'visual-c++',
       'web-services', 'winapi', 'windows', 'zos'], dtype=object)

## Exercise 2

- Which years and months of questions are in the data?


In [None]:
print(df['year'].unique())
print(df['month'].unique())

[2008]
[8 7]


## query using conditions based on column values

Sometimes we are interested in certain data points in the data, for example:

- questions in 2008 August
- questions tagged as 'python'

We can achieve this by using 'loc' method.

In [None]:
df.loc[(df['year'] == 2008) & (df['month'] == 8)]

Unnamed: 0,id,creation_time,tag,year,month,accepted_id,owner_user_id
0,8441,2008-08-12 03:58:21,.htaccess,2008,8,8454.0,57.0
1,3157,2008-08-06 08:15:28,.htaccess,2008,8,4449.0,476.0
2,14061,2008-08-18 00:49:33,.net,2008,8,14337.0,615.0
3,22623,2008-08-22 15:12:15,.net,2008,8,22628.0,357.0
4,29988,2008-08-27 12:40:40,.net,2008,8,30001.0,3191.0
...,...,...,...,...,...,...,...
9995,30211,2008-08-27 14:06:52,zip,2008,8,30473.0,1414.0
9996,17250,2008-08-20 00:16:40,zip,2008,8,,394.0
9997,12823,2008-08-15 22:09:05,zipcode,2008,8,12875.0,350.0
9998,25,2008-08-01 12:13:50,zos,2008,8,,23.0


In [None]:
df.loc[df['tag'] == 'python']

Unnamed: 0,id,creation_time,tag,year,month,accepted_id,owner_user_id
6740,32899,2008-08-28 17:49:02,python,2008,8,,720.0
6741,5909,2008-08-08 13:35:19,python,2008,8,5985.0,394.0
6742,469,2008-08-02 15:11:16,python,2008,8,3040.0,147.0
6743,29562,2008-08-27 05:03:07,python,2008,8,29575.0,2908.0
6744,742,2008-08-03 15:55:28,python,2008,8,,189.0
...,...,...,...,...,...,...,...
6831,19339,2008-08-21 04:29:07,python,2008,8,19343.0,680.0
6832,1171,2008-08-04 12:00:57,python,2008,8,28705.0,280.0
6833,19151,2008-08-21 00:36:11,python,2008,8,24377.0,145.0
6834,2311,2008-08-05 13:40:47,python,2008,8,2316.0,394.0


## Data cleaning
Sometime we need to modify the varibles for future use. In the example data set, some questions are tagged as 'net' and others are tagged as '.net' and since these are essentially the same tag, we can just change '.net' to 'net'

In [None]:
# Convert .net to net
tag = df.tag == '.net'
df['tag'][tag] = 'net'
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,id,creation_time,tag,year,month,accepted_id,owner_user_id
0,9,2008-07-31 23:40:59,net,2008,7,1404.0,1.0
1,39,2008-08-01 12:43:11,net,2008,8,45.0,33.0
2,14,2008-08-01 00:59:11,net,2008,8,,11.0
3,104,2008-08-01 15:12:34,net,2008,8,112.0,39.0
4,59,2008-08-01 13:14:33,.net-3.5,2008,8,,45.0
...,...,...,...,...,...,...,...
95,88,2008-08-01 14:36:18,visual-c++,2008,8,98.0,61.0
96,16,2008-08-01 04:59:33,web-services,2008,8,12446.0,2.0
97,88,2008-08-01 14:36:18,winapi,2008,8,98.0,61.0
98,108,2008-08-01 15:22:29,windows,2008,8,111.0,72.0
