# Numpy

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

### Resources
 * https://numpy.org/learn/ Tutorial
 * https://numpy.org/doc/stable/reference/generated/numpy.set_printoptions.html print options


## Basics

In [None]:
import numpy as np 

In [None]:
## Converting list to numpy array
lis = [1, 2, 3, 4]
lis2 = np.array(lis)

print(type(lis))
print(type(lis2))

In [None]:
## The arange function
np.arange(0, 3, 0.4)      ## np.arange(start, stop, step)

In [None]:
## Example of homogenity

a = np.array([0,'a',1.5])
a.dtype.name

In [None]:
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix

In [None]:
np.array(my_matrix)

### Zeros and Ones

In [None]:
np.zeros(3)

In [None]:
np.zeros((3,3))

In [None]:
print(np.ones(3))
print('\n',np.ones((3,3)))

In [None]:
## Identity matrix

np.eye(4)

### Linspace

In [None]:
np.linspace(0, 10, 3)   ### np.linspace(start, stop, number of data points)

In [None]:
np.linspace(-1,1,20)

### Random

#### rand

Creates an array of the given shape and populates it with random samples from a uniform distribution over [0, 1)

In [None]:
print(np.random.rand(2))
print('\n',np.random.rand(2,2))

### randint
Returns random integers from `low` (inclusive) to `high` (exclusive).  [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html)]

In [None]:
print(np.random.randint(1,25))

np.random.randint(1,100,10)

### seed
Can be used to set the random state, so that the same "random" results can be reproduced. [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.seed.html)]

In [None]:
np.random.seed(100)
np.random.rand(4)

In [None]:
np.random.seed(100)
np.random.rand(4)

In [None]:
np.random.seed(10)
np.random.rand(4)

In [None]:
np.random.seed(10)
np.random.rand(4)

### Reshape

In [None]:
a = np.arange(16)
print(a,'\n')
print(a.reshape(4,4),'\n')
print(a.reshape(2,8))

### max, min, argmax, argmin

These are useful methods for finding max or min values. Or to find their index locations using argmin or argmax

In [None]:
arr = np.random.randint(0,50,10)

print(arr,'\n')
print('max: ', arr.max(),'\n')
print('min: ', arr.min(),'\n')
print('argmax: ', arr.argmax(),'\n')
print('argmin: ', arr.argmin(),'\n')

### Indexing

In [None]:
a = np.random.randint(10,39,10)

print(a)
print(a[5])
print(a[1:5])

In [None]:
#Setting a value
a[5] = 9
a

In [None]:
a = np.array(([1, 2, 3],[4, 5, 6],[7, 8, 9]))

a[2, 1]

In [None]:
#Shape (2,2)
a[:2,:2]

## Operations

In [None]:
a = np.arange(10)

print('add \t \t: ',a + a)
print('add integer \t: ',a + 8)
print('Substract \t: ',a - 2)
print('multiply \t: ',a * a)
print('power \t\t: ',a ** 2)

In [None]:
a = np.arange(1, 11)

# Taking Square Roots
print('square root: \n',np.sqrt(a))

# Calculating exponential (e^)
print("\n\nExponential: \n",np.exp(a))

# Trigonometric Functions like sine
print('\n\nTrigonometric function: \n',np.sin(a))

# Taking the Natural Logarithm
print("\n\nNatural Log: \n",np.log(a))

In [None]:
a = np.arange(1, 11)

# Taking Square Roots
print('square root: \n',np.around(np.sqrt(a),3))

# Calculating exponential (e^)
print("\n\nExponential: \n",np.around(np.exp(a),3))

# Trigonometric Functions like sine
print('\n\nTrigonometric function: \n',np.around(np.sin(a),3))

# Taking the Natural Logarithm
print("\n\nNatural Log: \n",np.around(np.log(a),3))

# Basic plotting

### Reference: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html

In [None]:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randint(0,10,10)
y = np.arange(10)
plt.plot(x,y)
plt.show()

In [None]:
plt.scatter(x,y)
plt.show()

In [None]:
plt.scatter(x,y, c = np.array([1,0,0,0,1,1,0,1,1,0]))
plt.show()

In [None]:
plt.bar(x,y)

In [None]:
x = np.arange(50)
y = np.sin(x)
plt.plot(x,y)
plt.show()

# Scipy


### Resources
https://www.tutorialspoint.com/scipy/index.htm<br>
https://docs.scipy.org/doc/scipy/reference/tutorial/general.html<br>
SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python. 


* NumPy stands for Numerical Python while SciPy stands for Scientific Python. 
* SciPy is actually a collection of tools like integration, differentiation, gradient optimization, and much more. 
* Subpackages

    * cluster: Clustering algorithms

    * constants: Physical and mathematical constants

    * fftpack: Fast Fourier Transform routines

    * integrate: Integration and ordinary differential equation solvers

    * interpolate: Interpolation and smoothing splines

    * io: Input and Output

    * linalg: Linear algebra

    * ndimage: N-dimensional image processing

    * odr: Orthogonal distance regression

    * optimize: Optimization and root-finding routines

    * signal: Signal processing

    * sparse: Sparse matrices and associated routines

    * spatial: Spatial data structures and algorithms

    * special: Special functions

    * stats: Statistical distributions and functions

In [None]:
#constants
import scipy
from scipy.constants import *\

print("sciPy - pi = %.2f"%pi)
print("\nSpeed of light is: ",c,"m/s")
print("\nOne milli-gram is",milli,"grams")
print("\n1 Degree is {:.5f} radians".format(degree))

### linalg

In [None]:
import numpy as np
from scipy import linalg

A = np.mat('[1 2;3 4]') #A = np.array([[1,2],[3,4]])

print(linalg.inv(A))

print(A.dot(linalg.inv(A)) )

In [None]:
## Solving equations
A = np.array([[1, 2], [3, 4]])
print(A)
b = np.array([[5], [11]])
print(b)

ans = linalg.inv(A).dot(b)  
print(ans)

# Pandas

Pandas is one of the tools in Machine Learning which is used for data cleaning and analysis. It has features which are used for exploring, cleaning, transforming and visualizing from data.

You can think of pandas as an extremely powerful version of Excel, with a lot more features.

In [7]:
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(100)

### DataFrames

We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [6]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,1.618982,1.541605,-0.251879,-0.842436
B,0.184519,0.937082,0.731,1.361556
C,-0.326238,0.055676,0.2224,-1.443217
D,-0.756352,0.816454,0.750445,-0.455947
E,1.189622,-1.690617,-1.356399,-1.232435


### Indexing

In [3]:
df['X']

A    0.342680
B    0.514219
C    0.255001
D    0.816847
E    1.029733
Name: X, dtype: float64

In [4]:
df[['W','Y']]

Unnamed: 0,W,Y
A,-1.749765,1.153036
B,0.981321,0.22118
C,-0.189496,-0.458027
D,-0.583595,0.672721
E,-0.53128,-0.438136


In [9]:
df['X']['A']

1.5416051745134067

In [5]:
df.Y

A    1.153036
B    0.221180
C   -0.458027
D    0.672721
E   -0.438136
Name: Y, dtype: float64

### Creating a new column:

In [None]:
df['new'] = df['W'] + df['Y']

In [None]:
df

### Removing Columns

In [None]:
df.drop('new',axis=1)

In [None]:
# Not inplace unless specified!
df

In [None]:
df.drop('new',axis=1,inplace=True)

In [None]:
df

Can also drop rows this way:

In [None]:
df.drop('E',axis=0)

### Selecting Rows

In [None]:
df.loc['A']

Or select based off of position instead of label 

In [None]:
df.iloc[2]

### Selecting subset of rows and columns

In [None]:
df.loc['B','Y']

In [None]:
df.loc[['A','B'],['W','Y']]

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [None]:
df

In [None]:
df>0

In [None]:
df[df>0]

In [None]:
df[df['W']>0]

In [10]:
df[df['W']>0]['Y']

A   -0.251879
B    0.731000
E   -1.356399
Name: Y, dtype: float64

In [None]:
df[df['W']>0][['Y','X']]

For two conditions you can use | and & with parenthesis:

In [None]:
df[(df['W']>0) & (df['Y'] > 1)]

### Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [12]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Info on Unique Values

In [None]:
df['col2'].unique()

In [18]:
df['col2'].nunique()

3

In [20]:
df['col2'].value_counts()


444    2
555    1
666    1
Name: col2, dtype: int64

### Selecting Data

In [21]:
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]

In [22]:
newdf

Unnamed: 0,col1,col2,col3
3,4,444,xyz


### Applying Functions

In [None]:
def times2(x):
    return x*2

In [None]:
df['col1'].apply(times2)

In [None]:
df['col3'].apply(len)

In [None]:
df['col1'].sum()

### Permanently Removing a Column

In [None]:
del df['col1']

In [None]:
df

### Get column and index names:

In [None]:
df.columns

In [None]:
df.index

### Sorting and Ordering a DataFrame:

In [None]:
df

In [None]:
df.sort_values(by='col2') #inplace=False by default

# Data Pre-processing:
    Data preprocessing involves set of steps which are used to make data more efficient, more relevant, and suitable to use for machine learning and further data mining processes
### Need of Data preprocessing:
    Data in real world is quite dirty.  Data collection methods are often loosely controlled.That result into
    Incomplete data:  Missing important values
    Noisy data:  Data that contains outliers, or data for e.g.  Age as negative value
    Inconsistent data:  Data containing impossible combinations like year of birth 1990 and age50.
    Duplicate data:  This data may cause misleading statistics and hence may affect predictions.
    
    Throughout this notebook, we will learn some data pre-processing methods.

### 1. Loading the Dataset
    Normally we load dataset using pandas.
    Here as we are using .csv file we are loading it using pd.read_csv command.
    
    I dataset used here is house dataset with columns as city, number of bedrooms, whether it is near market area or not, price, and whether purchased or not.
    Here the column named purchased is dependent column.

In [None]:
#Reading data using pandas
data = pd.read_csv('House.csv')

In [2]:
data

NameError: name 'data' is not defined

In [None]:
#Delete extra row
del data['Unnamed: 0']
data.head(3)

### Spliting the dataset
    Generally data is splitted as independent and dependent variables before preprocessing.
    Here independent variables are the one who decide output which is whether house is purchased or not.

In [None]:
#Input or independent features
#As our dependent dataset is in last column we are selecting all except last one here as input
In = data.iloc[:, :-1].values
#Dependent features
#Our dependent dataset is in last column the last column for output
Out = data.iloc[:, -1].values

In [None]:
In[:5]    #first 5 values in data

In [None]:
Out

### Missing Values:
    
    In  case  of  missing  values  it  is  often  good  step  to  discard  the  respective  records  withmissing  values  to  increase  effectiveness  of  data.   But  this  often  leads  to  loss  of  huge  amount  of data or some important data.  In order to deal with that we use missing value imputation wheremissing values are replaced with some values like 0, 1 or mean/median values.  Some algorithms like scikit-learning are used to deal with data as In most cases missing values are coded as Nan, na or blank.
    
       In our data we can notice that we have missing values in rows 1,4,11,19 and so on. 
    Here in pace of missing data, we are going to add the average value of the present data.
 ### Scikitlearn:
    Here we are going to use library named scikitlearn named as sklearn. It is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling.
    
    The class from sklearn we are going to use here is simpleimputer to impute missing data.

In [None]:
from sklearn.impute import SimpleImputer
#creating the object for class
#missing vale variable is given none values of common types in nparray and 
#strategy suggests that the missing value is replaced by mean
SI = SimpleImputer(missing_values=np.nan, strategy='mean')

In [None]:
#Fit method is used to replace missing values
#fit will accept columns in input data with numerical values
#and this will connect our imputer to all missing values
SI.fit(In[:, 1:3])
#Transform will return new updeted version with all missing data imputation
In[:, 1:3] = SI.transform(In[:, 1:3])

In [None]:
#showing first ten rows in idependent data
In[:10] #as we can see here missing values are replaced by mean value of columns

### Managing categorical data:
    When we have categories in string data columns, it becomes difficult for ML model tocompute some correlations between these columns. In order to save this we need to encode such data into some numerical value. We can enumerate data insuch columns with each value with each number. But this may misguide our future model so we use onehotencoding here. In this categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. It turns the n number of categories in n columns with creating binary vectors for each countries.

In [None]:
#Dependent Variable
#OneHotEncoding converts the categorical data into numerical values
#ColumnTransformer will combine LabelEncoding and OneHotEncoding into one line of code
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
In= np.array(transformer.fit_transform(In))
#LabelEncoder converts values as yes no or true false in o,1
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
In[:, -1] = le.fit_transform(In[:, -1])
# Independent variable
Out = le.fit_transform(Out)

In [None]:
#showing first ten rows in idependent data
In[:10]

In [None]:
Out

### Splitting the dataset into the Training set and Test set

    Before applying to Machine learning model data is splitted to two sets as training and test dataset. Where training dataset is applied to machine learning model where test data set is preserved for testing purposes. It is best practice to reserve 20% data as test dataset. Before feature scaling, this data splitting must be performed. For splitting data scikitlearning provides train test split algorithm under model selection subpackage.

In [None]:
#train_test_split will split data into the given partition percentage
#The test dataset will be chosen randomly
from sklearn.model_selection import train_test_split
In_train, In_test, Out_train, Out_test = train_test_split(In, Out, test_size = 0.2, random_state = 1)   #0.2 to reserve 20% data as test data

In [None]:
print('Lengeth of dependent training set   :', len(In_train))
print('Lengeth of independent training set :', len(Out_train))
print('Lengeth of dependent test set   :', len(In_test))
print('Lengeth of independent test set :', len(Out_test))

### Feature Scaling:
    It puts all the features on same scale to save data from dominating value of some features over other. Feature scaling is not needed in all machine learning models. The two main feature scaling techniques are normalization and standardization. The measurement units can affect the data analysis. For example changing the unit from meter to inches can change the result upto very extent. In order to save the restriction on choice of unit for data we must normalize or standardize the data.
    
#### Standardization:
    standardization converts columns in the value in range -3 to 3
    the formulae for standardization is
    xstand = (x-mean(x))/standard_deviation(x)
    
#### Normalization:
    Normalization converts columns in the value in range 0 to 1
    the formulae for standardization is
    xnorm = (x-min(x))/(max(x)-min(x))
    
    Normalization is recommended when we have normal distribution in our feature values.
    Standardization can be used everytime.
    So here we will go with standardization

In [None]:
#Here we are standardizing the data
#Feature scaling must be done after spliting the dataset into training and testing
from sklearn.preprocessing import StandardScaler
#creating object for class StandardScaler()
scaler = StandardScaler()

In [None]:
#Here we dont need to apply feature scaling to categorical encoded data as it is already in the required range
#Aim of feature scaling is to have all the values in same range 
#The encoded columns already met the conditions 
#If we apply feature scaling on them we can get misleading values too
#So ignoring first four and last column from data and the  output we will apply feature scaling
#So columns 4,5 are the required ones
In_train[:, 4:6] = scaler.fit_transform(In_train[:, 4:6])
#Here fit will get the mean and standard deviation values 
#Transform will update the transformed values
In_test[:, 4:6] = scaler.transform(In_test[:, 4:6])

In [None]:
#showing first ten rows in idependent data
In_train[:10]

In [None]:
In_test

    This makes the dataset ready to apply to Machine Learning Models.