---

# A Brief Introduction to Python for Data Analytics


### Instructor
[George Milunovich](https://www.georgemilunovich.com)    
[george.milunovich@mq.edu.au](mailto:george.milunovich@mq.edu.au)   
[Associate Professor](https://researchers.mq.edu.au/en/persons/george-milunovich)  
Department of Actuarial Studies and Business Analytics  
Macquarie University   
Sydney, Australia  


---

## Part 4: Intro to Data Analysis

- [Libraries](#Libraries)
- [pandas](#pandas)
- [Working with pandas DataFrame](#Working-with-pandas-DataFrame)
- [Making Predictions with sklearn](#Making-Predictions-with-sklearn)

---
## Libraries

Python libraries are collections of functions and methods that allow us to perform many actions without writing your code 

- This term is often used interchangeably with “Python package” because packages can also contain modules and other packages (subpackages)
- However, it is often assumed that while a package is a collection of modules, a library is a collection of packages
- E.g. `Matplotlib`, `PyTorch`




---

## pandas

`pandas` (panel data) is a Python library written for data manipulation and analysis 
- It offers data structures and operations for manipulating numerical tables and time series


<br>

![image.png](images/tute_pic1.png)




- In order to load our dataset into Python, we need to access (import in Python terminology) a library data allows us to read csv or excel files. 
- We can import `pandas` and call it by its abbreviation `pd` using:



```
import pandas as pd
```


<hr style="width:30%;margin-left:0;"> 

### Dataset

- Our dataset contains data on credit card defaults available from [https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) 
- I have also saved the dataset to the `data` directory
- Each row represents a different client 
- Each column stores client attributes (e.g. age, merital status, etc)
- Last column being the target variable - whether the client has defaulted on the payment or not. 



```

# df = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls')

df = pd.read_csv('data/credit_cards_data.csv')

df
```

- We can skip top row

```
df2 = pd.read_csv('data/credit_cards_data.csv', skiprows=[0])

df2
```

<br>

Now that the data is in memory, we can manipulate it. First, let’s ask Python what type of thing `data` refers to:

```

print(type(df2))
```

---

### Working with `pandas` `DataFrame`

A few commands come in useful when working with pandas DataFrame objects. Try the following:

- `del df2['ID']` # delete column ID
- `df2.info()`    # DataFrame information
- `df2.columns`   # List columns
- `df2.shape`     # Show dimensions of the dataframe - no of columns followed by the number of rows
- `df2.head()`    # Print the first 5 examples (observations)
- `df2.tail()`    # Print the last 5 examples (observations)


<hr style="width:30%;margin-left:0;"> 

### Accessing Elements in `pandas`

There are multiple ways of accessing elements and slicing DataFrames in `pandas`
- Selecting multiple rows, note: does not include 3, e.g. ```df2[0:3]```
- Selecting multiple rows - method 2, note: includes 3, e.g. ```df2.loc[0:3]```
- Selecting multiple rows - method 2, note: includes 3, e.g. ```df2.iloc[0:3]```
- Selecting multiple columns, e.g.: ```df2[['LIMIT_BAL', 'SEX', 'EDUCATION']]```
- Selecting multple rows and columns using .loc[], e.g.: ```df2.loc[0:3, ['LIMIT_BAL', 'SEX', 'EDUCATION']]```
- Selecting multple rows and columns using .loc[], e.g.: ```df2.iloc[0:3, 0:3]]```


---
## Predicting Whether a Customer Will Default

- Lets try to predict whether a customer will default based on their characteristics
- First we will create `y` and `X`

```
y = df2['default payment next month']
print(y)


X = df2.copy()
print(X.columns)

del X['default payment next month']  # to remove y from X
print(X.columns)

```
Note: I used `copy()` here rather than a simple `=` assignment when copying X.   
This is because pandas are **mutable** - if I delete a 'default payment next month' from X it would get also deleted from df. Using copy() avoids this..

---
## Visualising Data

While there is no official plotting library, `matplotlib` is the de facto standard
- However, when using `pandas` we can use pandas functions which are actually shortcuts to `matplotlib`'s functions
- First, we will import the `pyplot` module from `matplotlib` and use it to plot the bar chart of the means of all the columns in df2 `DataFrame

```
import matplotlib.pyplot as plt

X.mean().plot(kind='bar')

plt.show()
```


<hr style="width:30%;margin-left:0;"> 

There are many different types of plots we can do with pandas and matplotlib.

For instance we could do a scatter plot between BILL_AMT4 and BILL_AMT5 using the following command

```
df2.plot(kind='scatter',x='BILL_AMT4',y='BILL_AMT5',color='red')

plt.show()
```


---
## Making Predictions with `sklearn`

Python's main Machine Learning library is `scikit-learn`
- Scikit-learn is a free machine learning library for Python
- It features various classification, regression and clustering algorithms


### Training and Test Datasets

- Lets split our data into **training** and **test** datasets
- We will use `train_test_split` library from sklearn for this purpose


```
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 1, stratify = y)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
```



<hr style="width:30%;margin-left:0;"> 

### Scaling Data

Standardization of features can help optimization algorithms train classifiers 

- `scikit-learn` contains `preprocessing` module which contains a number of classes used for standardization
- For now we will use `StandardScalar` class 
- Scale the data by doing the following transformation $X\sim(\mu, \sigma) \rightarrow Z\sim(0, 1)$

```
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)

X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

```


<hr style="width:30%;margin-left:0;"> 

### Training the Classifier

<hr style="width:30%;margin-left:0;"> 

### Compute Accuracies

```
print(f'Training Set Accuracy = {lr.score(X_train_scaled, y_train):.3f}')
print(f'Test Set Accuracy = {lr.score(X_test_scaled, y_test):.3f}')

```


<hr style="width:30%;margin-left:0;"> 

### Compute Predictions

- Lets compute the predicted probabilities and class label for customers 10,11,..,20

```
print(lr.predict_proba(X_test_scaled[10:20, :]))

print(lr.predict(X_test_scaled[10:20, :]))
````