# POLI 175 - Machine Learning for Social Sciences

## Python Refresh I

---

# Data Science with Pandas

## Load Pandas

Load pandas is very easy. Provided that the package is installed (if not, check [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html) how to install it), type:

In [1]:
import pandas as pd

## Load Data into Python

To start having fun, we need to load data into Python. We can do this in three ways: from a local file, from the internet, and from data typed in the keyboard.

### From Locale

First, we need to find the working directory. To do that, we need to use the library `os`. To do this you need to:

```
import os
print(os.getcwd())
```

In [2]:
import os
print(os.getcwd())

/Users/baraazekeria/UCSD/POLI175/poli175/wk01


## Load Data into Python

Then, you need to put the file in the folder. If you need to change the folder, use the function:

```
os.chdir("new_path_here")
```

Now that we know the folder, and the file is there, we can load it:

```
dat = pd.read_csv('file_name_here.csv')
```

### Load Dataset on the internet

The way we will load here is from the internet. 

For example, suppose the following dataset: https://raw.githubusercontent.com/umbertomig/qtm150/master/datasets/PErisk.csv.

To open, we use the `read_csv` command as we did with the locale version.

In [3]:
dat = pd.read_csv("https://raw.githubusercontent.com/umbertomig/qtm150/master/datasets/PErisk.csv")
dat.head()

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
0,Argentina,0,-0.720775,1,3,9.69017
1,Australia,1,-6.907755,5,4,10.30484
2,Austria,1,-4.910337,5,4,10.10094
3,Bangladesh,0,0.775975,1,0,8.379768
4,Belgium,1,-4.617344,5,4,10.25012


### From typing in the keyboard

We can also build a dataset from scratch.

For example, we could build a simple dataset in the following way:

```
dat = pd.DataFrame({
    "v1": ['d1', 'd2', 'd3'],
    "v2": [1, 2, 3],
    "v3": ['A', 'B', 'A'],
    "v4": [2.0, 1.1, 2.2]})
```

And this works for small datasets, with the inconvenience of having to type.

## Dataset Information

Suppose we have a pandas dataset called `dat`. To make it more realistic, use the following example:

```
# For me: PErisk
dat = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm150/master/datasets/PErisk.csv')

# For you: tips
dat2 = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/tips.csv')
```

If you are having VPN issues, let me know.

In [None]:
# My code here

### .info(.)

This method prints the information about the content of a dataset.

Syntax and Usage: `print(dat.info())`

In [None]:
# My code here

### .head(.)

This method prints the first few observations of the dataset.

Syntax and Usage: `print(dat.head())`

In [None]:
# My code here

### .shape

This prints the number of rows and columns of a dataset.

Syntax and Usage: `print(dat.shape)`

Note: no parenthesis necessary.

In [None]:
# My code here

### .describe(.)

This method gives us a few summary statistics of the dataset.

Syntax and Usage: `print(dat.describe())`

In [None]:
# My code here

### .values

This prints the observations in the dataset.

Syntax and Usage: `print(dat.values)`

Note: no parenthesis necessary.

In [None]:
# My code here

### .columns

This prints the variables information of the dataset.

Syntax and Usage: `print(dat.columns)`

Note: no parenthesis necessary.

In [None]:
# My code here

### .index

This prints informations about the dataset rows.

Syntax and Usage: `print(dat.index)`

Note: no parenthesis necessary.

In [None]:
# My code here

**Exercise**: Run the same examples for the dataset `dat2`

In [None]:
## Your answers here!

## Data Manipulation

### Subsetting variables (columns)

To subset variables the sintax is simple. When it is only one variable:

```
dat["var_name"]
```

When it is two or more, you need to enclose them in a list:

```
dat[["var1", "var2"]]
```

In [None]:
# My code here

### Subsetting cases (rows)

Now, to work with cases, notice that pandas allows us to do vectorized operations. For instance:

```
dat["var_name"] > some_number
```

Returns True, if the variable is greater than the number, and False otherwise. To subset the dataset, you need to:

```
dat[dat["var_name"] > some_number]
```

### Subsetting cases (rows)

For multiple comparisons, the syntax is also easy to use:

```
dat[ (dat["v1"] == "some_value") & (dat["v2"] == "some_other_value") ]
```

And if we want a command similar to `%in%` in R, we can use the `.isin(.)` method:

```
dat[ dat["v1"].isin(["some_value", "some_other_value"]) ]
```

In [None]:
# My code here

**Exercise**: Filter the `tips` dataset (our `dat2`) by:

1. Bills of more than 10 dollars
2. Smokers
3. Weekend

Do each of these separately, then do all together.

In [None]:
## Your answers here!

### Simple computations

It is simple to create new variables from older ones.

```
# Summing two variables
dat["my_new_var"] = dat["my_old_var1"] + dat["my_old_var2"]

# Multiplying by a constant
dat["my_new_var"] = dat["my_old_var1"] * constant

# Apply some numpy function (try to always use numpy functions, as pandas is based on numpy)
import numpy as np
dat["my_new_logged_var"] = np.log(dat["my_old_var"])
```

In [None]:
# My code here

**Exercise**: In the `tips` dataset, create the variable `prop_tip`, which is the proportion of the tip with relation to the total bill.

In [None]:
## Your answers here!

## Statistics

We can easily compute statistics from the data. Here are a few methods that we have available:

| Method           | Description                  |
|------------------|------------------------------|
| `.median()`      | Median                       |
| `.mean()`        | Mean                         |
| `.min()`         | Minimum                      |
| `.max()`         | Maximum                      |
| `.var()`         | Variance                     |
| `.std()`         | Standard Deviation           |
| `.sum()`         | Sum values                   |
| `.mode()`        | More frequent values         |
| `.quantile(val)` | Quantile value (btw 0 and 1) |

## Statistics

In [None]:
# My code here

**Exercise**: For the `tips` dataset:

1. Compute the mean and median of tip
2. Compute the mode of day
3. Compute the first quartile of the totbill.

In [None]:
## Your answers here!

## Questions?

## Great job! See you next class!