# Basics

In [3]:
import numpy as np

##### Data science: Data science is a multi-disciplinary field that uses scientific methods, 
# processes, algorithms and systems to extract knowledge and insights from structured and
# unstructured data. Data science is related to data mining, machine learning and big data.
# Data science is a field that combines statistics, computer science, and domain-specific knowledge to extract insights from data.
# Data science is used in a wide range of applications, including business, healthcare, finance, and more.

#-------------------------------------------------------------------------------------------------------
# Data Science Process: The data science process is a series of steps that data scientists follow to extract
# insights from data. The data science process typically involves the following steps:
#2. Collect data: The next step is to collect the data that you will use to solve the problem.
#1. Define the problem: The first step in the data science process is to define the problem that you want to solve.
#3. Clean and preprocess the data: The third step is to clean and preprocess the data so that it is ready for analysis.
#4. Analyze the data: The fourth step is to analyze the data to identify patterns and relationships.
#5. Model the data: The fifth step is to model the data using statistical or machine learning
#6. Evaluate the model: The sixth step is to evaluate the model to see how well it
#7. Communicate the results: The final step is to communicate the results of the analysis to stakeholders.

#-------------------------------------------------------------------------------------------------------
# outliers- An outlier is a data point that differs significantly from other observations.
## An outlier may be due to variability in the measurement or it may indicate experimental error;
#-------------------------------------------------------------------------------------------------------
## Data cleaning- Data cleaning is the process of identifying and correcting errors in a dataset.
##Data cleaning is an important step in the data analysis process, as it ensures that the data is accurate and reliable.
## Data cleaning may involve removing duplicates, correcting errors, and handling missing values.
#-------------------------------------------------------------------------------------------------------
# feature engineering- Feature engineering is the process of using domain knowledge to extract features from raw data.
## These features can be used to improve the performance of machine learning algorithms.
## Feature engineering is an important part of the machine learning pipeline, as it can have a significant impact on the performance of the model.

#-------------------------------------------------------------------------------------------------------
#Model Building- Model building is the process of creating a mathematical representation of a real-world process.
#In data science, model building involves using machine learning algorithms to create a model that can make predictions based on data.
#Model building is an important step in the data science process, as it allows us to make predictions and extract insights from data.

#-------------------------------------------------------------------------------------------------------
#Model Evaluation- Model evaluation is the process of assessing the performance of a machine learning model.
#Model evaluation is an important step in the data science process, as it allows us to determine how
#well the model is performing and identify areas for improvement.
#Model evaluation may involve using metrics such as accuracy, precision, and recall to evaluate the performance of the model.

#-------------------------------------------------------------------------------------------------------
#Model Deployment- Model deployment is the process of putting a machine learning model into production.
#Model deployment is an important step in the data science process, as it allows us to use the model to make predictions on new data.
#Model deployment may involve deploying the model to a cloud platform, such as AWS or Google Cloud, or integrating the model into an existing application.
#Model deployment is an important step in the data science process, as it allows us to use the model to make predictions on new data.

#-------------------------------------------------------------------------------------------------------
#Packages used in data science:
#Pandas: Pandas is a library for data manipulation and analysis in Python.
#NumPy: NumPy is a library for numerical computing in Python.
#Matplotlib: Matplotlib is a library for creating static, animated, and interactive visualizations in Python.   
#Scikit-learn: Scikit-learn is a library for machine learning in Python.
#Seaborn: Seaborn is a library for creating informative and attractive statistical graphics in Python
#TensorFlow: TensorFlow is a library for machine learning in Python.
#Keras: Keras is a library for deep learning in Python.
#PyTorch: PyTorch is a library for machine learning in Python.
#SciPy: SciPy is a library for scientific computing in Python.
#Statsmodels: Statsmodels is a library for estimating and interpreting statistical models in Python.
#NLTK: NLTK is a library for natural language processing in Python.

#Command to install packages all at once:
#pip install pandas numpy matplotlib scikit-learn seaborn tensorflow keras pytorch scipy statsmodels nltk


#TYPES OF PLOTS :
#1. Line Plot: A line plot is a type of plot that displays data points connected by straight line segments.
#Libraries used: Matplotlib, Seaborn
#2. Bar Plot: A bar plot is a type of plot that displays data using rectangular bars.
#Libraries used: Matplotlib, Seaborn
#3. Scatter Plot: A scatter plot is a type of plot that displays data points as dots.
#Libraries used: Matplotlib, Seaborn
#4. Histogram: A histogram is a type of plot that displays the frequency distribution of a dataset.
#Libraries used: Matplotlib, Seaborn
#5. Box Plot: A box plot is a type of plot that displays the distribution of a dataset.
#Libraries used: Matplotlib, Seaborn
#6. Violin Plot: A violin plot is a type of plot that displays the distribution of a dataset.
#Libraries used: Matplotlib, Seaborn
#7. Heatmap: A heatmap is a type of plot that displays data as a matrix of colors.
#Libraries used: Matplotlib, Seaborn
#8. Pair Plot: A pair plot is a type of plot that displays pairwise relationships in a dataset.
#Libraries used: Seaborn



NumPy:Array Object:A Numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers.
The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
NumPy arrays are similar to lists, but are more efficient and flexible.
It is created using the array() function from the NumPy library.
Syntax: numpy.array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)

In [4]:
#  1-dimensional numpy array
arr1 = np.array([1, 2, 3])
print("1-dimensional array: ", arr1)
print("Type of arr1: ", type(arr1))

1-dimensional array:  [1 2 3]
Type of arr1:  <class 'numpy.ndarray'>


In [12]:
#  2-dimensional numpy array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("\n2-dimensional array: \n", arr2)
print("Type of arr2: ", type(arr2))


2-dimensional array: 
 [[1 2 3]
 [4 5 6]]
Type of arr2:  <class 'numpy.ndarray'>


In [7]:
arr3 = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print("\n3-dimensional array: \n", arr3)
print("Type of arr3: ", type(arr3))


3-dimensional array: 
 [[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]
Type of arr3:  <class 'numpy.ndarray'>


In [8]:
#Example for each attribute is given below:
import numpy as np
arr = np.array([[1, 2, 3], [4, 2 ,5]])
print("No of dimensions: ", arr.ndim)
print("Shape of array: ", arr.shape)# returns the shape of the array
print("Size of array: ", arr.size)# returns the total number of elements in the array
print("Array stores the elements of type: ", arr.dtype)# returns the data type of the elements in the array
print("Item size of array/Length of array: ", arr.itemsize)# returns the size in bytes of each element in the array
print("Data of array: ", arr.data)# returns a buffer object pointing to the start of the array's data

No of dimensions:  2
Shape of array:  (2, 3)
Size of array:  6
Array stores the elements of type:  int64
Item size of array/Length of array:  8
Data of array:  <memory at 0x000001CAFA8E41E0>


# Reshape operation/Function:
### It reshapes the given array into number of rows and columns


In [9]:
arr = np.array([[1, 2, 3], [4, 2 ,5]])
print("Original array: \n", arr)
print("Array after reshaping: \n", arr.reshape(2, 3))

Original array: 
 [[1 2 3]
 [4 2 5]]
Array after reshaping: 
 [[1 2 3]
 [4 2 5]]


# Flatten Function

In [10]:
arr = np.array([[1, 2, 3], [4, 2 ,5]])
print("Original array: \n", arr)
print("Array after flattening: ", arr.flatten())


Original array: 
 [[1 2 3]
 [4 2 5]]
Array after flattening:  [1 2 3 4 2 5]


# Transpose of an array

In [11]:
#Transpose of an array
arr = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
print("Original array: \n", arr)
print("Transpose of array: \n", np.transpose(arr))

Original array: 
 [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
Transpose of array: 
 [[1 4 7]
 [2 5 8]
 [3 6 9]]


# Arithmetic, Statistical and String operations

## 1.Arithmetic Operations : Addition,Sutractions,Multiplication,Division

In [13]:
#1.1 Addition of two arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])
print("Array1: \n", arr1)
print("Array2: \n", arr2)
print("Sum of two arrays: \n", np.add(arr1, arr2))


Array1: 
 [[1 2 3]
 [4 5 6]]
Array2: 
 [[ 7  8  9]
 [10 11 12]]
Sum of two arrays: 
 [[ 8 10 12]
 [14 16 18]]


In [None]:
#1.2 Subtraction of two arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])
print("Array1: \n", arr1)
print("Array2: \n", arr2)
print("Subtraction of two arrays: \n", np.subtract(arr1, arr2))

## Multiplication of Two Arrays

In [None]:
#1.3 Multiplication of two arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]]) # Define the first array
arr2 = np.array([[7, 8, 9], [10, 11, 12]]) # Define the second array
print("Array1: \n", arr1)
print("Array2: \n", arr2)
print("Multiplication of two arrays: \n", np.multiply(arr1, arr2))

## Division of Two Arrays

In [None]:
#1.4 Division of two arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]]) # Define the first array
arr2 = np.array([[7, 8, 9], [10, 11, 12]]) # Define the second array
print("Array1: \n", arr1)
print("Array2: \n", arr2)
print("Division of two arrays: \n", np.divide(arr1, arr2))

## Power of Two Arrays

In [None]:
#1.5 Power of two arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]]) # Define the first array
arr2 = np.array([[2, 3, 4], [3, 1, 3]]) # Define the second array
print("Array1: \n", arr1)
print("Array2: \n", arr2)
print("Power of  array: \n", np.power(arr1, arr2))
#Here, the power of the first array is calculated with the elements of the second array.
# The result is a new array where each element is the power of the corresponding element in the first
# array to the element in the second array.
#Output:
#Array1:
#[[ 1  2  3]
# [ 4  5  6]]
#Array2:
#[[ 2  3  4]
# [ 3  1  3]]
#Power of two arrays:
#[[  1   8  81]
# [ 64  25 216]]

Array1: 
 [[1 2 3]
 [4 5 6]]
Array2: 
 [[2 3 4]
 [3 1 3]]
Power of two arrays: 
 [[  1   8  81]
 [ 64   5 216]]


## Calculating Mean,Mode ,Median,Standard deviation and Variance of the elemeents in an array

In [None]:
#Mean of an array
array = np.array([[1, 2, 3], [4, 5, 6]])
print("Original array: \n", array)
print("Mean of array: ", np.mean(array))

In [None]:
#Mode of an array
array = np.array([[1, 2, 3], [4, 5, 6], [1, 2, 2]])
print("Original array: \n", array)
print("Mode of array: ", np.mean(array))

In [None]:
#Median of an array
array = np.array([[1, 2, 3], [4, 5, 6]])
print("Original array: \n", array)
print("Median of array: ", np.median(array))

In [None]:
#Standard deviation of an array
array = np.array([[1, 2, 3], [4, 5, 6]])
print("Original array: \n", array)
print("Standard deviation of array: ", np.std(array))

In [None]:
#Variance of an array
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Original array: \n", arr)
print("Variance of array: ", np.var(arr))

### Why do we calculate standard deviation and variance in Data Mining ?

### Understanding data spread
The spread of data is a measure of how much the data points are spread out from the mean.
### Feature selection
Feature selection is the process of selecting a subset of the most relevant features from the original set of features
### Outlier Detection
Outlier detection is the process of identifying data points that are significantly different from the rest of the data
### Normalization and scaling
Normalization and scaling are techniques used to transform the data into a common scale, making it easier to compar