# Standard Process
1. Collecting Data
</br></t>Collect and gather data from several sources
2. Preprocessing Data
</br></t>Filter, clean, and transform data into required format
3. Analyzing and Finding Insights
</br></t>Explore, describe, and visualize the data for insights and conclusions
4. Insights Interpretations
</br></t>Understand the insights and the impact each variable has
5. Storytelling
</br></t>Communicate results in the form of a story in simple-terms

# Data mining (Knowledge Discovery in Databases)
1. Data Cleaning
</br></t>Preprocess data. I.e. Remove noise, handle missing values, detect outliers
2. Data Integration
</br></t>Combine and integrate data from different sources using data migration and ETL tools
3. Data Selection
</br></t>Recollect relevant data for the upcoming analysis tasks
4. Data Transformation
</br></t>Engineer data into the requried form
5. Data Mining
</br></t>Apply data mining techniques to discover useful and uknown patterns
6. Pattern Evaluation
</br></t>Evaluate extracted patterns
7. Knowledge Presentation
</br></t>Visualize and present extracted knowledge for decision-making purposes

# SEMMA
Sample, Explore, Modify, Model, Asses
1. Sample
</br></t>Identify different databases and merge them. Then, select a data sample
</br></t>that's sufficient for the modeling process
2. Explore 
</br></t>Discover relationships among variables, visualize data, and get initial interpretations
3. Modify
</br></t>Prepare data for modeling. I.e. Deal with missing values, detect outliers,
</br></t>transform features, and create new features
4. Model
</br></t>Select and apply different modeling techniques. I.e. linear/lgoistic regression,
</br></t>backpropagation networks, KNN, support vector machines, decision trees, and Random Forest
5. Assess
</br></t>Evaluate models with the appropriate evaluation measures

# CRISP-DM
1. Business Understanding
</br></t>Understand the business scenario and requirements for designing an
</br></t>analytical goal and initial action plan
2. Data Understanding
</br></t>Understand the data and its collection process, perform data quality checks,
</br></t>and gain initial insights
3. Data Preparation
</br></t>Prepare data for analytics. Handle missing values, detect and handle outliers,
</br></t>normalize data, and feature engineer. Most time consuming
4. Modeling
</br></t>Decide and apply a modeling technique based on the data and objective
5. Evaluation
</br></t>Assess and test the model's performance on validation and test data.
</br></t>Apply the appropriate model evaluation measures. I.e. MSE, RMSE,
</br></t>R-Square for regression and accuracy, precision, recall, and F1
6. Deployment
</br></t>Deploy the chosen model into the production environment. Team effort

# Data Types
* Categorical
</br></t>Nominal
</br></t></t>Names/labels of categorized variables in an unordered fashion
</br></t>Ordinal
</br></t></t>Names/labels of categorized variables in an ordered fashion
* Numerical
</br></t>Discrete
</br></t></t>A countable finite number
</br></t>Continuous
</br></t></t>Non-countable and infinite number


# Types of Data
* **Categorical**
<br><t>**Nominal**
<br><t></t><t>Names/labels for values. Unordered in nature. Calculating mode may be useful
<br><t>**Ordinal**
<br><t><t>Names/labels for values. Ordered in nature, but unknown magnitude. Calculating
<br><t><t>mode and median may be useful
* **Numerical**
<br><t>**Discrete**
<br><t></t><t>Sets that are countable and finite. Can have an interval/ratio scale. 
<br><t>**Continuous**
<br><t></t><t>Sets that are not countable and can have an infinite number of values.

# Central Tendency
The trend of values clustered around the averages such as the mean, mode, and 
<br>median values. 
* Mean: Sum of observations/number of observations
* Mode: Highest occuring observation
* Median: Mid-point observation in a group of observations. Also known as the 50th percentile

# Dispersion
* Range
<br><t>Difference between the maximum and minimum values of the observations
* IQR (Inter-Quartile-Range)
<br><t>Difference between the third and first quartiles. Contains 50% of the observations.
* Variance 
<br><t>The deviation from the mean. The average value of the squared difference between
<br><t>the observation and the mean. 
<br><t>$\sum_{i=1}^{N}(x_i - \overline{x})^2$
* Standard deviation
<br><t>Square root of the variance. Easier to evaluation the exact deviation from the mean.

# Skewness
Measures the symmetry of a distribution.
* Right-Tailed (Positive Skewness): Tail at the right. Mean > Median
* Left-Tailed (Negative Skewness): Tail at the left. Mean < Median
* Kurtosis: Measures tailedness (tail thickness) compared to a normal distribution.
<br><t>Can have zero, negative (thin-tailed), or positive(>3 and fat-tailed) value.
<br>
* Zero Kurtosis -> Mesokurtic
* Negative Kurtosis + Thin-Tailed -> Platykurtic
* Kurtosis>3 + Fat-Tailed -> Leptokurtic

# Correlation Coefficients
* Pearson
<br><t>"Standard"
* Kendall
<br><t>Non-parametric measure for rank correlation. Use when data is skewed or
<br><t>an outlier is affected since it doesn't have any assumptions for data 
<br><t>distribution.
* Spearman
<br><t>Non-parametric measure to measure association between two ordinal variables.
<br><t>If both variables are binary, then Pearson=Spearman=Kendall.

# Central Limit Theorem
The sampling distribution approaches a normal distribution with an increase in size.
The mean of the sample distributions approaches the population mean and 
the standard deviation of the means decreases.

# Probability Sampling Methods
* Simple Random Sampling
<br><t>Select by chance.
* Stratified Sampling
<br><t>Select a sample based on similarity criteria. Improves accuracy by reducing selection bias.
* Systematic Sampling
<br><t>Select at regular intervals (every nth value)
* Cluster Sampling
<br><t>Select by dividing population into clusters based on criteria. The entire cluster is used for sampling.

# Non-probability Sampling Methods
* Convenience Sampling
<br><t>Select based on availibility and willingness. Prone to volunteer bias.
* Purposive Sampling
<br><t>Select based on the statistician's judgement of who will participate. I.e.
<br><t>News Reporters choosing people whose opinions they want.
* Quota Sampling
<br><t>Select based on predefined proportions until they are met. Selects items in strata using random sampling
* Snowball Sampling
<br><t>Select on referral when respondents are difficult to find and trace.