## Data Analysis

**Data Science** is a combination of multiple disciplines that use statistics, data analysis, and machine learning to analyze data and extract knowledge and insights from it. It involves collecting data, cleaning data, performing exploratory data analysis, building and evaluating machine learning models, and communicating insights to stakeholders.    
By using Data Science, companies can make:
- Better decisions (should we choose A or B)
- Predictive analysis (what will happen next?)
- Pattern discoveries (find patterns, or maybe hidden information in the data)


*Data Science* can be applied in nearly every part of a business where data is available.    
- Consumer goods / Stock markets / Industry / Politics / Logistic companies / E-commerce.
- For route planning: To discover the best routes to ship
To foresee delays for flight/ship/train etc. (through predictive analysis)
- To create promotional offers
- To find the best suited time to deliver goods
- To forecast the next year’s revenue for a company
- To predict who will win elections

**Data scientists** explore data, select and build models (machine), tune parameters such that a model fits observations (learning), then use the model to predict and understand aspects of new unseen data. They must find patterns within the data. Before finding the patterns, They must organize the data in a standard format.    
Here is how a Data Scientist works:
 - Ask the right questions - To understand the business problem.
 - Explore and collect data - From the database, weblogs, customer feedback, etc.
 - Extract the data - Transform the data to a standardized format.
 - Clean the data - Remove erroneous values from the data.
 - Find and replace missing values - Check for missing values and replace them with a suitable value (e.g., average value).
 - Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is important).
 - Analyze data, find patterns and make future predictions.
 - Represent the result - Present the result with useful insights in a way the "company" can understand.

> What is **Data**? Data is a collection of information. One purpose of Data Science is to structure data, making it interpretable and easy to work with.    

Data can be categorized into two groups:
1. Unstructured Data is not organized. We must organize the data for analysis purposes.
2. Structured Data is organized and easier to work with. A database table is a table with structured data.

How to Structure Data? We can use an array or a database table to structure or present data.

> **Machine Learning**: We encounter machine learning models every day. For example, when Netflix recommends a show to you, they used a model based on what you and other users have watched to predict what you would like. When Amazon chooses a price for an item, they use a model based on how similar items have sold in the past. When your credit card company calls you because of suspicious activity, they use a model based on your past activity to recognize anomalous behavior. In Machine Learning, we talk about **supervised** and **unsupervised** *learning*. 

> **Supervised Learning** is when we have a known target based on past data (for example, predicting what price a house will sell for). Within supervised learning, there are classification and regression problems. Regression is predicting a numerical value (for example, predicting what price a house will sell for) and classification is predicting categorical value, predicting what class something belongs to (for example, predicting if a borrower will default on their loan).

> **Unsupervised Learning** is when there isn't a known past answer (for example, determining the topics discussed in restaurant reviews). Within unsupervised learning, there are clustering and non-clustering problems.

Some classification problems where we're predicting which class something belongs to.
- Predicting who would survive the Titanic crash
- Determining a handwritten digit from an image
- Using biopsy data to classify if a lump is cancerous

A number of popular techniques are used to tackle these problems:
- Logistic Regression
- Decision Trees
- Random Forests
- Neural Networks


> **Statistical Learning**: is concerned with modeling and understanding vast and complex datasets using methods rooted in statistics. Statistics is used in all kinds of science and business applications. Statistics give us more accurate knowledge which helps us make better decisions. Statistics can focus on making predictions about what will happen in the future. It can also focus on explaining how different things are connected. Good statistical explanations are also useful for predictions.

*Typical Steps of Statistical Methods:*
- Gathering data
- Describing and visualizing data
- Making conclusions

Knowing which types of data are available can tell you what kinds of questions you can answer with statistical methods. Knowing which questions you want to answer can help guide what sort of data you need. A lot of data might be available, and knowing what to focus on is important.


*How is Statistics Used?* Statistics can be used to understand and make conclusions about the group that you want to know more about. This group is called the population. A population could be many different kinds of groups. It could be:
- All of the people in a country
- All the businesses in an industry
- All the customers of a business
- All people that play football who are older than 45

> **Gathering Data** is the first step in statistical analysis. Say for example that you want to know something about all the people in France. The population is then all of the people in France. It is too much effort to gather information about all of the members of a population (e.g. all 67+ million people living in France). It is often much easier to collect a smaller group of that population and analyze that. This is called a sample. The sample needs to be similar to the whole population of France. It should have the same characteristics as the population. If you only include people named Jacques living in Paris who are 48 years old, the sample will not be similar to the whole population, you will need people from all over France, with different ages, professions, and so on. If the members of the sample have similar characteristics (like age, profession, etc.) to the whole population of France, we say that the sample is representative of the population. A good representative sample is crucial for statistical methods.

> **Describing data** is typically the second step of statistical analysis after gathering data. Descriptive Statistics: The information (data) from your sample or population can be visualized with graphs or summarized by numbers. This will show key information in a simpler way than just looking at raw data. It can help us understand how the data is distributed. Graphs can visually show the data distribution. Examples of graphs include Histograms, Pie charts, Bar graphs, Box plots. Some graphs have a close connection to numerical summary statistics. Calculating those gives us the basis of these graphs. For example, a box plot visually shows the quartiles of data distribution. Quartiles are the data split into four equal-size parts or quarters. A quartile is one type of summary statistics.

> **Summary statistics:** take a large amount of information and sum it up in a few key values. Numbers are calculated from the data which also describe the shape of the distributions. These are individual 'statistics’. Some important examples are Mean, median, and mode, Range and interquartile range, Quartiles, and percentiles, Standard deviation, and variance.

> **Data Visualization:** "A picture is worth a thousand words." Data visualization can reveal patterns that are not obvious and communicate the insights more effectively. The ability to take data—to be able to understand it, process it, extract value from it, visualize it and, communicate it. Design visualization creates graphic images from concepts and ideas, making concepts and options clearer to project owners and partners. Visualization changes from the invisible to the visible. It is used in industries such as architecture, engineering, entertainment, and manufacturing. 

**Data Mining Tasks:**
- Prediction Methods: Use some variables to predict unknown or future values of other variables.
- Description Methods: Find human-interpretable patterns that describe the data.
- Clustering
- Assosiation Rules
- Anamaly Detection
- Predictive Modeling

> **Classification** is an ordered set of related categories used to group data according to its similarities. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.

*Examples of Classification Task:*
Classifying credit card transactions as legitimate or fraudulent
Classifying land covers (water bodies, urban areas, forests, etc.) using satellite data Categorizing news stories as finance, weather,entertainment, sports, etc Identifying intruders in the cyberspace Predicting tumor cells as benign or malignant

> **Regression** refers to a data mining technique that is used to predict the numeric values in a given data set. For example, regression might be used to predict the product or service cost or other variables. It is also used in various industries for business and marketing behavior, trend analysis, and financial forecast.

> **Clustering:** Finding groups of objects such that the objects in a group will be similar (or related) to one another (Intra-cluster distances are minimized) and different from (or unrelated to) the objects in other groups(Inter-cluster distances are maximized).

*Association Rule Discovery:* Predicts the occurrence of an item based on occurrences of other items. E.g: someone who buys cereal will most probably buy milk as well.

*Deviation/Anomaly/Change Detection:* Detect significant deviations from normal behavior.
Applications:
- Credit Card Fraud Detection
- Network Intrusion
- Detection
- Identify anomalous behavior from sensor networks for monitoring and
- surveillance.
- Detecting changes in the global forest cover.


**The workflow of any machine learning projects** consists of basically four main parts are given as follows:

- **Gathering data:** The process of gathering data depends on the project it can be real-time data or the data collected from various sources such as a file, database, survey and other sources.
- **Data pre-processing:** Usually, within the collected data, there is a lot of missing data, extremely large values, unorganized text data or noisy data and thus cannot be used directly within the model, therefore, the data require some pre-processing before entering the model.
- **Training and testing the model:** Once the data is ready for algorithm application, It is then ready to put into the machine learning model. Before that, it is important to have an idea of what model is to be used which may give a nice performance output. The data set is divided into 3 basic sections i.e. The training set, validation set and test set. The main aim is to train data in the train set, to tune the parameters using ‘validation set’ and then test the performance test set.
- **Evaluation:** Evaluation is a part of the model development process. It helps to find the best model that represents the data and how well the chosen model works in the future. This is done after training of model in different algorithms is done. The main motto is to conclude the evaluation and choose model accordingly again.

### Data

**Data Categories:** By knowing the type of your data, you will be able to know what technique to use when analyzing them.
Different types of data: There are two main types of data: Qualitative (or ‘categorical’) and quantitative (or ‘numerical’). These main types also have different subtypes depending on their measurement level.

**Qualitative Data:** Information about something that can be sorted into different categories that can’t be described directly by numbers. With categorical data, we can calculate statistics like proportions. For example, the proportion of Indian people in the world, or the percent of people who prefer one brand to another.
The other examples of qualitative data are:
 - What language do you speak
 - Favorite holiday destination
 - Opinion on something (agree, disagree, or neutral)
 - Colors
 - Brands
 - Nationality
 - Professions


**Quantitative Data:** Information about something that is described by numbers. With numerical data, we can calculate statistics like the average income in a country, or the range of heights of players in a football team.

Examples of Quantitative Data: 
 - The height or weight of a person or object
 - Room Temperature
 - Scores and Marks (Ex: 59, 80, 60, etc.)
 - Time
 - Income
 - Age

The Quantitative data are further classified into two parts:
- Discrete: Numbers are counted as "whole".    
Example: You cannot have trained 2.5 sessions, it is either 2 or 3.    
Note: binary attributes are a special case of discrete attributes.
Examples of Discrete Data:
 - Total numbers of students present in a class
 - Cost of a cell phone
 - Numbers of employees in a company
 - The total number of players who participated in a competition
 - Days in a week

**Continuous:** Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes, and 20 seconds, or 7.533 hours.

Examples of Continuous Data: 
 - Height of a person
 - Speed of a vehicle
 - Time-taken to finish the work 
 - Wi-Fi Frequency
 - Market share price

> Measurement Levels: Different data types have different measurement levels. The main mentioned data types are split into the following measurement levels. These measurement levels are also called measurement 'scales’.

**Nominal Level:** Categories (qualitative data) without any order.
Examples of Nominal Data:
 - Color of hair (Blonde, Red, Brown, Black, etc.)
 - Marital status (Single, Widowed, Married)
 - Nationality (Indian, German, American)
 - Gender (Male, Female, Others)
 - Eye Color (Black, Brown, etc.)

**Ordinal level:** Categories that can be ordered (from low to high), but the precise "distance" between each is not meaningful.
Examples of Ordinal Data:
 - When companies ask for feedback, or satisfaction on a scale of 1 to 10
 - Letter grades in the exam (A, B, C, D, etc.)  Consider letter - grades from F to A: Is the grade A precisely twice as good as a B? - And, is the grade B also twice as good as C? Exactly how much - distance it is between grades is not clear and precise. If the - grades are based on amounts of points on a test, you can say that - there is a precise "distance" on the point scale, but not the - grades themselves.
 - Ranking of people in a competition (First, Second, Third, etc.)
 - Economic Status (High, Medium, and Low)
 - Education Level (Higher, Secondary, Primary)

**Interval Level:** Data that can be ordered and the distance between them is objectively meaningful. But there is no natural 0-value where the scale originates.

Examples: Years in a calendar, Temperature measured in Fahrenheit.
Note: Interval scales are usually invented by people, like degrees of temperature.
0 degrees Celcius is 32 degrees of Fahrenheit. There are consistent distances between each degree (for every 1 extra degree of Celcius, there is 1.8 extra Fahrenheit), but they do not agree on where 0 degrees is.

**Ratio Level:** Data that can be ordered and there is a consistent and meaningful distance between them. And it also has a natural 0-value.
Examples: Money, Age, Time. Data that is on the ratio level (or "ratio scale") gives us the most detailed information. Crucially, we can compare precisely how big one value is compared to another. This would be the ratio between these values, like twice as big, or ten times as small.

Examples of data quality problems:
 - Noise and outliers
 - Wrong/Fake data
 - Missing values
 - Duplicate data

**Noise:** An invalid signal overlapping valid data, or simply the wrong data. What causes it? Misspelling, typing mistakes, slang etc.
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on a television screen. 

**Outliers:** are data objects with characteristics that are considerably different than most of the other data objects in the data set. Outliers are noise that interferes with data analysis.

Outlier vs Noise: 
Noise is a random error (or a modification of original values) that is not interesting or desirable. Noisy data is meaningless data or corrupted data.  An “outlier” is a data point or value that differs considerably from all or most other data in a dataset although the data is clean and collected properly, it’s outside a normal range.

**Missing values:** Reasons for missing values: Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children).

**Handling missing values:**
Eliminate data objects or variables
Estimate missing values: Example: time series of temperature, census results. Ignore the missing value during the analysis

**Duplicate data:** Data sets may include data objects that are duplicates or almost duplicates of one another. Examples: Same person with multiple email addresses.

**Data cleaning** is a subset of data preprocessing: The process of dealing with duplicate data issues that include:
Aggregation / Sampling / Dimensionality Reduction / Feature Subset Selection / Feature Creation / Discretization and Binarization / Attribute Transformation.

**Aggregation:** combining two or more attributes (or objects into a single attribute (or object)).    
Purpose:
- Data Reduction: Reduce the number of attributes or objects.
- Change of Scale: Cities aggregated into regions, states, countries, etc.
- More “Stable” Data: Aggregated data tends to have less variability.

**Sampling:** is the main technique employed for data selection, it is often used for both the early stages of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time-consuming, a sample will work almost as well as using the entire data set if the sample is representative.

**Curse of Dimensionality:** When dimensionality (the number of columns) increases, data becomes increasingly sparse/scattered in the space that it occupies. Density and distance become less meaningful.
Dimensionality Reduction:    
Purpose:    
- Avoid curse of dimensionality
- Reduce time and memory required
- Allow data to be more easily visualized
- May help to eliminate irrelevant features or reduce noises

**Feature Subset Selection:** Another way to reduce the dimensionality of data.

**Redundant Features:** Duplicate or all of the information contained in one or more other attributes.  Example: the purchase price of a product and the amount of sales tax paid.
Irrelevant Features: Contain no information that is useful for the data mining task at hand. Examples: students’ ID is often irrelevant to the task of predicting students’ GPA.
Feature Creation: original attributes are not always the best representation of information. Create new features which are more efficient/focused.

Difference between Normalization and Standardization (Subset of feature scaling):

**Standardization:**
- Force data to have a mean of 0 and a standard deviation of 1
Mean and standard deviation are used for scaling
- It is much less affected by outliers
- Scikit-Learn provides a transformer called StandardScaler for standardization
- It is often called Z-Score Normalization
- It is useful when the feature distribution is Normal or Gaussian

**Normalization:**
- Minimum and maximum value of features are used for scaling
- It is used when features are of different scales
- Scales values between [0, 1] or [-1, 1]
- It is really affected by outliers
- Scikit-Learn provides a transformer called MinMaxScaler for Normalization
- It is useful when we don’t know about the distribution
- It is often called Scaling Normalization
