# Basics of Numpy and Pandas


---

This notebook discusses basics of two most important Python libraries for data analytics and statistical modeling - `Numpy` and `Pandas`,

### Numpy

---

* Numpy array - from list, special functions
* Array operations
* 2-D arrays
* Indexing and slicing
* Conditional subsetting
* Array-array operations

### Pandas

---

* Pandas series
* DataFrame - creation, read from files
* Quick checking DataFrame
* Descriptive stats on DataFrame
* Indexing, slicing, conditional subsetting
* Operations on specific rows/columns

## Numpy array from a Python list
Numpy arrays behave like **true numerical vectors**, not ordinary lists. That's why they are used for all mathematical operations, machine learning algorithms, and as basis of Pandas DataFrame for data analytics.

In [1]:
import numpy as np
lst1=[1,2,3]
array1 = np.array(lst1)
print(lst1)

[1, 2, 3]


In [3]:
print(type(lst1))

<class 'list'>


In [4]:
print(type(array1))

<class 'numpy.ndarray'>


In [None]:
a = 2

type(a)

print(type(a)) 

In [None]:
import pandas as pd 

df=pd.read_csv('wine.data.csv')

print(df)

print(df.describe)

In [None]:
import pandas as pd 

df5=pd.read_excel('HW.xlsx')

print(df5)

print(df5.describe)

In [None]:
lst2=[10,11,12]
array2 = np.array(lst2)
print(array2)

In [None]:
print(f"Adding two lists {lst1} and {lst2} together: {lst1+lst2}")

In [None]:
print(f"Adding two numpy arrays {array1} and {array2} together: {array1+array2}")

## Mathematical operations with/on Numpy arrays

In [None]:
print("array2 multiplied by array1: ",array1*array2)

In [None]:
print("array2 divided by array1: ",array2/array1)

In [None]:
print("array2 raised to the power of array1: ",array2**array1)

In [None]:
# sine function
print("Sine: ",np.sin(array1))

In [None]:
# logarithm

print("Natural logarithm: ",np.log(array1))

In [None]:
print("Base-10 logarithm: ",np.log10(array1))

In [None]:
print("Base-2 logarithm: ",np.log2(array1))

In [None]:
# Exponential
print("Exponential: ",np.exp(array1))

## How to generate arrays easily?
* `np.zeros`
* `np.ones`
* `np.arange`
* `np.linspace`

In [None]:
print("A series of zeroes:",np.zeros(7))

In [None]:
print("A series of ones:",np.ones(9))

In [None]:
print("A series of numbers:",np.arange(5,16))

In [None]:
print("Numbers spaced apart by 2:",np.arange(0,11,2))

In [None]:
print("Numbers spaced apart by float:",np.arange(0,11,2.5))

In [None]:
print("Every 5th number from 30 in reverse order: ",np.arange(30,-1,-5))

In [None]:
print("11 linearly spaced numbers between 1 and 5: ",np.linspace(1,5,11))

## Multi-dimensional arrays

In [None]:
my_mat = [[1,2,3],[4,5,6],[7,8,9]]

print("\n", my_mat)

mat = np.array(my_mat)

print("\n",mat)

print("\n Type/Class of this object:",type(mat))

print("\n Here is the matrix\n----------\n",mat,"\n----------")

In [None]:
my_tuple = np.array([(1.5,2,3), (4,5,6)])

mat_tuple = np.array(my_tuple)

print (mat_tuple)

## Dimension, shape, size, and data type of the 2D array

In [None]:
print("\n Dimension of this matrix: ",mat.ndim,sep='') 

In [None]:
print("\n Size of this matrix: ", mat.size,sep='') 

In [None]:
print("\n Shape of this matrix: ", mat.shape,sep='')

In [None]:
print("\n Data type of this matrix: ", mat.dtype,sep='')

## Zeros, Ones, Random, and Identity Matrices and Vectors

In [None]:
print("Vector of zeros: ",np.zeros(5))

In [None]:
print("\n Matrix of zeros: \n",np.zeros((3,4)))

In [None]:
print("\n Vector of ones: ",np.ones(4))

In [None]:
print("\n Matrix of ones: ",np.ones((4,2)))

In [None]:
print("\n Matrix of 5â€™s: ",5*np.ones((3,3)))

In [None]:
print("\n Identity matrix of dimension 2:",np.eye(2))

In [None]:
print("\n Identity matrix of dimension 4:\n",np.eye(4))

In [None]:
print("\n Random matrix of shape (4,3):\n",np.random.randint(1,10,size=(4,3)))

## Reshaping, Ravel, Min, Max, Sorting

In [None]:
a = np.random.randint(1,100,30)
print(a)

In [None]:
b = a.reshape(2,3,5)
print("\n",b) 

In [None]:
c = a.reshape(1,3,10)
print("\n",c)

In [None]:
print ("\n Shape of a:\n", a.shape) 

print ("\n Shape of b:\n", b.shape) 

print ("\n Shape of c:\n ", c.shape)

In [None]:
print("\na looks like:\n",a)

In [None]:
print("\nb looks like:\n",b)

In [None]:
print("\nc looks like:\n",c)

In [None]:
b_flat = b.ravel()

print(b_flat)

## Indexing and slicing

In [None]:
arr = np.arange(0,11)

print("Array:",arr)

In [None]:
print("\n Element at 7th index is:", arr[7])

In [None]:
print("\n Elements from 3rd to 5th index are:", arr[3:6])

In [None]:
print("\n Elements up to 4th index are:", arr[:4])

In [None]:
print("\n Elements from last backwards are:", arr[-1::-1])

In [None]:
print("\n 3 Elements from last backwards are:", arr[-1:-6:-2])

In [None]:
arr2 = np.arange(0,21,2)

In [None]:
print("\n New array:",arr2)

In [None]:
print("\n Elements at 2nd, 4th, and 9th index are:", arr2[[2,4,9]]) # Pass a list as a index to subset

In [None]:
mat = np.random.randint(10,20,15).reshape(3,5)

print("\n Matrix of random 2-digit numbers\n",mat)

In [None]:
print("\n Double bracket indexing\n")
print("\n Element in row index 1 and column index 2:", mat[1][2])

In [None]:
print("\n Single bracket with comma indexing\n")
print("\n Element in row index 1 and column index 2:", mat[1,2])
print("\n Row or column extract\n")

In [None]:
print("\n Matrix of random 2-digit numbers\n",mat)
print("\n Entire row at index 2:", mat[2])
print("\n Entire column at index 3:", mat[:,3])

In [None]:
print("\n Matrix of random 2-digit numbers\n",mat)
print("\n Subsetting sub-matrices\n")
print("\n Matrix with row indices 1 and 2 and column indices 3 and 4\n", mat[1:3,3:5])
print("\n Matrix with row indices 0 and 1 and column indices 1 and 3\n", mat[0:2,[1,3]])

## Conditional subsetting

In [None]:
mat = np.random.randint(10,100,15).reshape(3,5)

print("\n Matrix of random 2-digit numbers\n",mat)

In [None]:
print ("\n Elements greater than 50\n", mat[mat>50])

In [None]:
mat>50

In [None]:
mat*(mat>50)

## Array operations (array-array, array-scalar, universal functions)

In [None]:
mat1 = np.random.randint(1,10,9).reshape(3,3)

mat2 = np.random.randint(1,10,9).reshape(3,3)

print("\n1st Matrix of random single-digit numbers\n",mat1)
print("\n2nd Matrix of random single-digit numbers\n",mat2)

In [None]:
print("\nAddition\n", mat1+mat2)

In [None]:
print("\nMultiplication\n", mat1*mat2)

In [None]:
print("\nDivision\n", mat1/mat2)

In [None]:
print("\n Lineaer combination: 3*A - 2*B\n", 3*mat1-2*mat2)

In [None]:
print("\n Addition of a scalar (100)\n", 100+mat1)

In [None]:
print("\n Exponentiation, matrix cubed here\n", mat1**3)

In [None]:
print("\n Exponentiation, sq-root using pow function\n",pow(mat1,0.5))

## Pandas series

In [None]:
import pandas as pd

In [None]:
labels = ['a','b','c']

my_data = [10,20,30]

arr = np.array(my_data)

d = {'a':10,'b':20,'c':30}

print ("Labels:", my_data)
print("My data:", my_data)
print("Dictionary:", d)

In [None]:
s1=pd.Series(data=my_data)
s1

In [None]:
s2=pd.Series(data=my_data, index=labels)
print(s2)

In [None]:
s3=pd.Series(arr, labels)
print(s3)

In [None]:
s4=pd.Series(d)
print(s4)

## Pandas DataFrame

In [None]:
matrix_data = np.random.randint(1,20,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)

In [None]:
d={'a':[10,20],'b':[30,40],'c':[50,60]}

df2=pd.DataFrame(data=d,index=['X','Y'])

print(df2)

## DataFrame can be created reading directly from a CSV or an Excel file

Refer to this article, that I wrote for O'Reily Media's Medium publication, to understand various data sources that can be read in Pandas DataFrame directly.

**[Read in the data in a Pandas DataFrame like an expert](https://medium.com/97-things/read-in-the-data-in-a-pandas-dataframe-like-an-expert-d03058edae98)**

In [None]:
df3 = pd.read_csv("wine.data.csv")

In [None]:
df3.head()

In [None]:
print(df3.head())

In [None]:
df4 = pd.read_excel("Height_Weight.xlsx")

In [None]:
df4

## Quick checking DataFrames
* `.head()`
* `.tail()`
* `.sample()`
* `.info()`
* `.describe()`

In [None]:
df3.head()

In [None]:
df3.head(3)

In [None]:
df3.tail(7)

In [None]:
df3.sample(5)

In [None]:
df3.info()

In [None]:
df4.info()

In [None]:
df3.describe().transpose()

In [None]:
df4.describe()

## Basic descriptive statistics on a DataFrame
* `mean()`
* `std()`
* `var()`
* `min()` and `max()`

In [None]:
df3.mean()

In [None]:
df3.std()

In [None]:
df4.var()

In [None]:
df4.min()

## Indexing, slicing columns and rows of a DataFrame

In [None]:
print("\nThe 'Name' column\n",'-'*25, sep='')
print(df4['Name'])
print("\nType of the column: ", type(df4['Name']), sep='')
print("\nThe 'Name' and 'Weight' columns indexed by passing a list\n",'-'*55, sep='')
print(df4[['Name','Weight']])
print("\nType of the pair of columns: ", type(df4[['Name','Weight']]), sep='')

In [None]:
print("\nLabel-based 'loc' method can be used for selecting row(s)\n",'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc' method can be used for selecting row(s)\n",'-'*70, sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])

## Conditional subsetting

In [None]:
df4['Height']>155

In [None]:
df4[df4['Height']>155]

Which students have a **height more than 155 cm and weigh less than 140 lbs**?

In [None]:
df4[(df4['Height']>155) & (df4['Weight']<140)]

## Operations on specific columns/rows

In [None]:
df3.head()

#### What is the standard deviation of Magnesium and Ash contents for the wine dataset?

In [None]:
df3[['Magnesium','Ash']].std()

#### What is the range of alcohol content in the wine dataset?

In [None]:
range_alcohol=df3['Alcohol'].max()- df3['Alcohol'].min()
print("The range of alcohol content is: ", round(range_alcohol,3))

#### Top 5 percentile in terms of Flavanoids?

In [None]:
np.percentile(df3['Flavanoids'],95)

In [None]:
df3[df3['Flavanoids']>=3.4975]

**Show the average alcohol, ash, and magnesium content of the wine brands which rank top 5 percent in terms of flavanoids**

In [None]:
df3[df3['Flavanoids']>=3.4975][['Ash','Alcohol','Magnesium']].mean()

## Create a new column as a function of mathematical operations on existing columns

In [None]:
df4

In [None]:
df4['BMI']=df4['Weight']*0.453592/(df4['Height']/100)**2
df4

In [None]:
df4.sort_values(by='BMI')

## Use `inplace=True` to make the changes reflected on the original DataFrame

In [None]:
df4

In [None]:
df4.sort_values(by='BMI',inplace=True)

In [None]:
df4