<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-pandas" data-toc-modified-id="Importing-pandas-1">Importing pandas</a></span></li><li><span><a href="#Reading-in-data" data-toc-modified-id="Reading-in-data-2">Reading in data</a></span><ul class="toc-item"><li><span><a href="#The-read_csv-function" data-toc-modified-id="The-read_csv-function-2.1">The read_csv function</a></span></li><li><span><a href="#Some-important-options-for-the-read_csv()-function" data-toc-modified-id="Some-important-options-for-the-read_csv()-function-2.2">Some important options for the <code>read_csv()</code> function</a></span></li></ul></li><li><span><a href="#Important-notice:-reading-in-files-from-your-computer" data-toc-modified-id="Important-notice:-reading-in-files-from-your-computer-3">Important notice: reading in files from your computer</a></span></li><li><span><a href="#Exploring-the-data" data-toc-modified-id="Exploring-the-data-4">Exploring the data</a></span><ul class="toc-item"><li><span><a href="#Exploring-the-data-in-the-spyder-variable-explorer" data-toc-modified-id="Exploring-the-data-in-the-spyder-variable-explorer-4.1">Exploring the data in the spyder variable explorer</a></span></li><li><span><a href="#Exploring-the-data-using-code" data-toc-modified-id="Exploring-the-data-using-code-4.2">Exploring the data using code</a></span></li></ul></li><li><span><a href="#Columns-and-indices" data-toc-modified-id="Columns-and-indices-5">Columns and indices</a></span></li><li><span><a href="#Accessing-the-columns" data-toc-modified-id="Accessing-the-columns-6">Accessing the columns</a></span></li><li><span><a href="#Creating-new-variables-(columns)" data-toc-modified-id="Creating-new-variables-(columns)-7">Creating new variables (columns)</a></span></li><li><span><a href="#Exercises" data-toc-modified-id="Exercises-8">Exercises</a></span><ul class="toc-item"><li><span><a href="#Exercise-1" data-toc-modified-id="Exercise-1-8.1">Exercise 1</a></span></li><li><span><a href="#Exercise-2" data-toc-modified-id="Exercise-2-8.2">Exercise 2</a></span></li><li><span><a href="#Exercise-3" data-toc-modified-id="Exercise-3-8.3">Exercise 3</a></span></li><li><span><a href="#Exercise-4" data-toc-modified-id="Exercise-4-8.4">Exercise 4</a></span></li></ul></li></ul></div>

# Pandas: introduction

## Importing pandas

pandas is part of the anaconda python distribution and can be directly imported.

In [2]:
import pandas



## Reading in data

###  The read_csv function

The most common scenario is to read in a data file from disk. Pandas has powerful functions to read in data. For example, the ```read_csv()```function has many (and I mean many) options:

```read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)```

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Luckily, we wont be needing most of the options most of the time.

Here, I will use example data on the survivorship on the Titanic. The data is described here: https://vincentarelbundock.github.io/Rdatasets/doc/datasets/Titanic.html

You can also use this link: https://tinyurl.com/y894ft6g

In [3]:
# options used: read in column 0 as the index, 'NA' is read as missing values.
data = pandas.read_csv('https://tinyurl.com/y894ft6g',index_col=0,na_values='NA')
data.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


### Some important options for the ```read_csv()``` function

Some options will come in handy. These include the ```separator``` option and the ```header``` option. Often, data files use different seperators or do not come with a header (i.e., a row of column names). If so, these options allow reading in the file correctly. The default values for these options are:

+ separator: ```sep=', '```
+ header:  ```header='infer```



In [4]:
import pandas
silly_data = pandas.read_csv('data/silly.txt')
silly_data

Unnamed: 0,123*234*123*1234
0,4543*2342*123*224
1,456*1223*1234*123


In [5]:
silly_data = pandas.read_csv('data/silly.txt', sep='*')
silly_data

Unnamed: 0,123,234,123.1,1234
0,4543,2342,123,224
1,456,1223,1234,123


In [18]:
silly_data = pandas.read_csv('data/silly.txt', sep='*', header=None)
silly_data

Unnamed: 0,0,1,2,3
0,123,234,123,1234
1,4543,2342,123,224
2,456,1223,1234,123


If the data does not contain a header, you can set the column names to more convient values as follows,

In [21]:
silly_data.columns = ['some_name', 'column2', 'age', 'test']
silly_data

Unnamed: 0,some_name,column2,age,test
0,123,234,123,1234
1,4543,2342,123,224
2,456,1223,1234,123


## Important notice: reading in files from your computer

The ```read_csv()``` can only find the data file if the data file are in the same folder.

![title](Selection_318.png)

If you do not place the python script and the data file in the same folder, you need to provide the (relative) path to the file.

## Exploring the data

### Exploring the data in the spyder variable explorer

[...]


### Exploring the data using code

Print the top rows of the dataset

In [7]:
print(data.head(3))

                                  Name PClass   Age     Sex  Survived  SexCode
1         Allen, Miss Elisabeth Walton    1st  29.0  female         1        1
2          Allison, Miss Helen Loraine    1st   2.0  female         0        1
3  Allison, Mr Hudson Joshua Creighton    1st  30.0    male         0        0


How big is our dataset?

In [8]:
print(data.shape)

(1313, 6)


Remember, the dataframe allows for multiple data types.

In [9]:
print(data.dtypes)

Name         object
PClass       object
Age         float64
Sex          object
Survived      int64
SexCode       int64
dtype: object


## Columns and indices

In [10]:
a = data.columns
print(a)

Index(['Name', 'PClass', 'Age', 'Sex', 'Survived', 'SexCode'], dtype='object')


In [11]:
b = data.index
print(b)

Int64Index([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
            ...
            1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1312, 1313],
           dtype='int64', length=1313)


## Accessing the columns

In [12]:
sex = data.SexCode
#print(sex)

sex = data['SexCode']
print(sex)

sex.mean()

1       1
2       1
3       0
4       1
5       0
       ..
1309    0
1310    0
1311    0
1312    0
1313    0
Name: SexCode, Length: 1313, dtype: int64


0.3518659558263519

In [13]:
name = data.Age
print(name.head())

1    29.00
2     2.00
3    30.00
4    25.00
5     0.92
Name: Age, dtype: float64


## Creating new variables (columns)

You can make new variables (columns) based on existing variables. And add the new columns to the dataframe.

In [14]:
data['Months'] = data['Age'] * 12
data.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode,Months
1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1,348.0
2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1,24.0
3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0,360.0
4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1,300.0
5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0,11.04


In [15]:
data['Kids'] = data['Months'] < 15
data.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode,Months,Kids
1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1,348.0,False
2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1,24.0,False
3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0,360.0,False
4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1,300.0,False
5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0,11.04,True


In [16]:
data['Silly'] = data['Months'] / data['Age']
data.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode,Months,Kids,Silly
1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1,348.0,False,12.0
2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1,24.0,False,12.0
3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0,360.0,False,12.0
4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1,300.0,False,12.0
5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0,11.04,True,12.0


In [17]:
data['Child'] =  data.Age < 18
data.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode,Months,Kids,Silly,Child
1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1,348.0,False,12.0,False
2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1,24.0,False,12.0,True
3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0,360.0,False,12.0,False
4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1,300.0,False,12.0,False
5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0,11.04,True,12.0,True


## Exercises

### Exercise 1

+ Read in the cars.txt file (download it or read it in from http://tinyurl.com/ybbw5gwg)
+ Print the first lines of the data
+ Print the column names
+ Calculate the difference in mpg on the highway and the city, add this difference as a new variable to the data frame.


### Exercise 2


Download the `wages1833.csv` data file (from the data folder) and save it to your computer. You can also use this direct url: http://tinyurl.com/y6orr2bg

This file contains data on the wages of Lancashire cotton factory workers in 1833.  For each age category, the file lists the following:

* `age`: age in years
* `mnum:` number of male workers of the corresponding age
* `mwage`: average wage of male workers of the corresponding age
* `fnum`: number of female workers of the corresponding age
* `fwage`: average wage of female workers of the corresponding age

More info on the data can be found in this paper:  Boot, H.M. 1995. How Skilled Were the Lancashire Cotton Factory Workers in 1833? Economic History Review 48: 283-303.

Write a script that does the following:

* Read in the data as a pandas dataframe
* Adds a new variable that lists the difference between the number of male and female workers
* Adds a new variable ```diff_pct``` that gives the difference in average wage between the male and female workers expressed as a percentage of the female wage. For example, if the average female wage is 90 and the male wage is 135, this new column lists the number 50. In the form of an equation, this gives:

$$
diff_{pct} = 100 \times \frac{(mwage - fwage)}{fwage}
$$



### Exercise 3


Suppose, I want to read in a file named `data.txt`. The columns of the data are separated by tabs. How can I make sure that pandas reads this data correctly? Complete the blank in the following piece of code:

```python
data = pandas.read_csv('data.txt', _______)
```


### Exercise 4

Suppose, my data file named `data.csv` has no header (no variable names). How can I read in this file correctly?