# 3. `numpy` and `pandas`

This notebook follows Chapter 10 in the [Python Workshop textbook](https://search.ebscohost.com/login.aspx?direct=true&db=edsool&AN=edsool.9781804610619&site=eds-live&scope=site&authtype=shib&custid=s8516548). An electronic version of this book is freely available from the library after logging in with TAMU credentials!

Taken together, pandas and NumPy are masterful at handling big data. They are built for speed, efficiency, readability, and ease of use.

**Pandas** provide you with a unique framework to view and modify data. Pandas handles all data-related tasks such as creating DataFrames, importing data, scraping data from the web, merging data, pivoting, concatenating, and more.

**NumPy**, short for Numerical Python, is more focused on computation. NumPy interprets the rows and columns of pandas DataFrames as matrices in the form of NumPy arrays. When computing descriptive statistics such as the mean, median, mode, and quartiles, NumPy is blazingly fast.

## 3.1 NumPy and basic stats

NumPy is designed to handle big data swiftly. It includes the following essential components according to the NumPy documentation: 

- A powerful n-dimensional array object 
- Sophisticated (broadcasting) functions 
- Tools for integrating C/C++ and Fortran code 
- Useful linear algebra, Fourier transform, and random number capabilities 

From NumPy documentation: The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

Going forward, instead of using lists, you will use NumPy arrays. NumPy arrays are the basic elements of the NumPy package. NumPy arrays are designed to handle arrays of any dimension. 

Numpy arrays can be indexed easily and can have many types of data, such as float, int, string, and object, but the **types must be consistent** to improve speed.


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import statsmodels.api as sm
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### 3.1.1 Exercise 128: Converting Lists to NumPy Arrays

In [None]:
test_scores = [70,65,95,88]
type(test_scores)

**Note** Now that numpy has been imported, you can access all numpy methods, such as numpy arrays. Type `np.` + Tab on your keyboard to see the breadth of options. You are looking for an array.

### 3.1.2 Exercises 129-131: Summary statistics

**Note** Median here is not a method of `np.array`, but it is a method of `numpy`. (The mean may be computed in the same way, as a method of numpy.)


## 3.2 Matrices

A DataFrame is generally composed of rows, and each row has the same number of columns. From one point of view, it's a two-dimensional grid containing lots of numbers. It can also be interpreted as a list of lists, or an array of arrays.

NumPy has methods for creating matrices or n-dimensional arrays. One option is to place random numbers between 0 and 1 into each entry, as follows.

### 3.2.1 Exercise 132: Matrices

**Note** The `np.random.seed()` ensures that the same collection of random numbers are drawn every time, which is important for reproducibility.  You can set your own seed.

In [None]:
#Indexing, slicing, and accessing

#First row

#First column

#First entry

#Third row, fourth column

#Multiple rows and columns


In [None]:
#Matrix mean

#First row mean

#Last column mean


### 3.2.1 Computation time for large matrices

Now that you have gotten a hang of creating random matrices, you can see how long it takes to generate a large matrix and compute the mean:

In [None]:
%%time 
np.random.seed(seed=60) 
big_matrix = np.random.rand(100000, 100)
big_matrix.mean()

In the next exercise, you will create arrays using NumPy and compute various values through them. One such computation you will be using is `ndarray.numpy.ndarray` is a (usually fixed-size) multidimensional array container of items of the same type and size.

### 3.2.3 Exercise 133: Creating an array to implement NumPy computations

In [None]:
# np.arange returns evenly spaced values 
# within a given interval.


In [None]:
#reshape to 20 rows and 5 cols


**Note** `.reshape` fills the new matrix by rows (not by columns)

In [None]:
#matrix computations

#dot product of two arrays (for 2-d arrays, is same as matrix multiplication)
#matrix multiplication

In [None]:
#dimension-specific computations


#column means
#row means

#column std
#row std

#multi-dimensional
mat4=np.arange(1, 13).reshape(2,2,3)
mat4

np.mean(mat4,axis=0) 
np.mean(mat4,axis=1)
np.mean(mat4,axis=2) 
np.mean(mat4,axis=(0,1))

## 3.3 The pandas library

Pandas is the Python library that handles data on all fronts. Pandas can import data, read data, and display data in an object called a `DataFrame`. A DataFrame consists of rows and columns. One way to get a feel for DataFrames is to create one.

### 3.3.1 Exercise 134: Using DataFrames to Manipulate Stored Student testscore Data

In this exercise, you will create a dictionary, which is one of many ways to create a pandas DataFrame. You will then manipulate this data as required.

In [None]:
# create dictionary of test scores
test_dict = {'Corey':[63,75,88], 
             'Kevin':[48,98,92], 
             'Akshay': [87, 86, 85]}
print(test_dict)

# create dataframe

You can inspect the DataFrame: 

 - First, each dictionary key is listed as a column. 
 - Second, the rows are labeled with indices starting with 0 by default. 
 - Third, the visual layout is clear and legible. Each column of a DataFrame is officially represented as a Series. A series is a one-dimensional ndarray. 
 
 Now, you will rotate the DataFrame, which is also known as a transpose, a standard `pandas` method. A transpose turns rows into columns and columns into rows.

### 3.3.2 Exercise 135: DataFrame Computations with the Student testscore Data

Now, select a range of values from specific rows and columns. You will be using .iloc with the index number, which is a function present in a pandas DataFrame for selection. This is shown in the following step:

In [None]:
#access first row by index number


In [None]:
#access column by name


### 3.3.3 Exercise 136: Computing DataFrames within DataFrames

In [None]:
# Defining a new DataFrame from first 2 rows and last 2 columns 


In [None]:
# Select first 2 rows and last 2 columns using index numbers 


Now, add a new column to find the quiz average of our students. You can generate new columns in a variety of ways. One way is to use available methods such as the mean. In pandas, it's important to specify the axis. An axis of 0 represents the columns, and an axis of 1 represents the rows.

In [None]:
# Define new column as mean of other columns 


In [None]:
# Create a new column as a list


In [None]:
# Delete column


In the next section, you will be looking at new rows and NaN, which is an official NumPy term.

### 3.3.4 New rows and NaN

It's not easy to add new rows to a pandas DataFrame. A common strategy is to generate a new DataFrame and then to concatenate the values. Say you have a new student who joins the class for the fourth quiz. What values should you put for the other three quizzes? The answer is nan. It stands for not a number. nan is an official NumPy term. It can be accessed using np.nan. It is case-sensitive. In later exercises, you will look at how nan can be used. In the next exercise, you will look at concatenating and working with null values.

Much more can be said on this.  **For more details**, see 
 - https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions
 - https://pandas.pydata.org/docs/user_guide/missing_data.html
 - https://stackoverflow.com/questions/60115806/pd-na-vs-np-nan-for-pandas

### 3.3.5 Exercise 137: Concatenating and Finding the Mean with Null Values for Our testscore Data



In [None]:
# Create new DataFrame of one row 

# Concatenate DataFrames 


Notice that all values are floats except for **Quiz_4**. There will be occasions when you need to cast all values in a particular column as another type.

### 3.3.6 Casting column types

## 3.4 Data

Now that you have been introduced to NumPy and pandas, you will use them to analyze some real data. Data scientists analyze data that exists in the cloud or online. One strategy is to download data directly to your computer.
 
 
It is recommended to create a new folder to store all of your data. You can open your Jupyter Notebook in this same folder.

### 3.4.1 Downloading data

Data comes in many formats, and pandas is equipped to handle most of them. In general, when looking for data to analyze, it's worth searching the keyword "dataset." A dataset is a collection of data. Online, "data" is everywhere, whereas datasets contain data in its raw format. You will start by examining the famous Boston Housing dataset from 1980, which is available on our GitHub repository. This dataset can be found here https://packt.live/31Cd96j. You can begin by first downloading the dataset onto our system.

### 3.4.2 Reading data

Here is a list of standard data files that pandas will read, along with the code for reading data:

- csv files: `pd.read_csv('file_name')`
- excel files: `pd.read_excel('file_name')`
- feather files: `pd.read_feather('file_name')`
- html files: `pd.read_html('file_name')`
- json files: `pd.read_json('file_name')`
- sql database: `pd.read_sql('file_name')`

If the files are clean, pandas will read them properly. Sometimes, files are not clean, and changing function parameters may be required. It's advisable to copy any errors and search for solutions online. A further point of consideration is that the data should be read into a DataFrame. Pandas will convert the data into a DataFrame upon reading it, but you need to save DataFrame as a variable.

### 3.4.3 Exercise 138: Reading and viewing the Boston Housing dataset

In [None]:
housing_df = pd.read_csv('HousingData.csv')
housing_df.head()

Data description can be found at this link: https://search.r-project.org/CRAN/refmans/mlbench/html/BostonHousing.html

### 3.4.4 Exercise 139: Gaining data insights from the Boston Housing dataset

This confirms that you have 506 rows and 14 columns. Notice that shape does not have any parentheses after it. This is because it's technically an attribute and pre-computed.

### 3.4.5 Null values

You need to do something about the null values. There are several popular choices when dealing with null values: 

- Eliminate the rows: Can work if null values are a very small percentage, such as 1% of the total dataset. 
- Replace missing values with the mean/median/mode and add a missing indicator (for use in downstream modeling efforts)
- Impute missing values: Depends on the reason for missingness.  Can use other fields to impute missing values in a given field if it is reasonable to assume that missingness can be "explained" by other **observed** values.  This is not always the case.


### 3.4.6 Exercise 140: Null value operations on a dataset

Breakdown of the above code:
- `housing_df` is the DataFrame. 
- `.loc` allows you to specify rows and columns. 
- `:` selects all rows. 
- `housing_df.isnull().any()` selects only columns with null values.
- `.describe()` pulls up the statistics.

### 3.4.7 Replacing null values

Pandas include a nice method, `fillna`, which can be used to replace null values. It works for individual columns and entire DataFrames. You will use three approaches, 

- replacing the null values of a column with the mean
- replacing the null values of a column with another value
- replacing all the null values in the entire dataset with the median. 

In [None]:
# replacing with the mean

# replacing with another value

# replacing with median



After eliminating all null values, the dataset is much cleaner. There may also be unrealistic outliers or extreme outliers that will lead to poor prediction. These can often be detected through visual analysis, which you will be covering in the next section.

## 3.5 Visualization

In [None]:
# Set up seaborn dark grid
sns.set()

You began our introduction to data analysis with **NumPy**, Python's incredibly fast library for handling massive matrix computations. Next, you learned about the fundamentals of **pandas**, Python's library for handling DataFrames. Taken together, you used NumPy and pandas to analyze the Boston Housing dataset, which included descriptive statistical methods and Matplotlib and Seaborn's graphical libraries. You also learned about advanced methods for creating clean, clearly labeled, publishable graphs.

## 3.6 Fun example: Broadcasting is not always faster

From https://numpy.org/doc/stable/user/basics.broadcasting.html

"Broadcasting is a powerful tool for writing short and usually intuitive code that does its computations very efficiently in C. However, there are cases when broadcasting uses unnecessarily large amounts of memory for a particular algorithm. In these cases, it is better to write the algorithm’s outer loop in Python. This may also produce more readable code, as algorithms that use broadcasting tend to become more difficult to interpret as the number of dimensions in the broadcast increases."

See the following example and explanation from https://stackoverflow.com/questions/49632993/why-python-broadcasting-in-the-example-below-is-slower-than-a-simple-loop.
 

In [None]:
#function to take squared some of rows after row-wise subtraction

def norm_loop(M, v):
  n = M.shape[0]
  d = np.zeros(n)
  for i in range(n):
    d[i] = np.sum((M[i] - v)**2)
  return d

def norm_bcast(M, v):
     return np.sum((M - v)**2, axis=1)

#broadcasting is better in this instance for smaller datasets
M = np.random.random_sample((10, 100))
v = M[0]
%timeit norm_loop(M, v) 
%timeit norm_bcast(M, v)

#bigger datasets tell a different story
M = np.random.random_sample((1000, 10000))
v = M[0]
%timeit norm_loop(M, v) 
%timeit norm_bcast(M, v)

What gives? It comes down to **memory access**. 

In the broadcast version, every element of `M` is subtracted from `v`. By the time the last row of `M` is processed, the results of processing the first row have been evicted from cache, so for the second step, these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since `M` is quite large, parts of the cache are cleared on each step to accommodate all of the data.

In the looped version, each row is processed completely in one smaller step, leading to fewer cache misses (i.e. inability to retreive needed data from cache because it has been cleared and needs to be reloaded) and overall faster code.