# Statistics & Probability for Data Science (SP-101)

This course is designed for data science beginners who are interested in developing a strong foundation in statistics and probability. The course covers the fundamental concepts of descriptive and inferential statistics, probability theory, and statistical inference. Students will learn the essential statistical and probabilistic concepts and tools necessary for data analysis and visualization.

## Course Duration:
The course can be completed in approximately 4-5 weeks, with 3-5 hours of study per week.

## Course Requirements:
Basic Mathematics: 10th Class Basic Mathematics
Basic Programming: Basics of Python programming language
Statistical Software:  or Google Colab, or any other software used for data analysis and visualization

## Course Outcome:
Upon completion of this course, students will be able to:

- Understand the fundamental concepts of statistics and probability theory
- Conduct hypothesis testing, correlation and regression analysis
- Interpret and communicate statistical results
- Understand how to apply statistics and probability concepts to machine learning and artificial intelligence applications



---

# What is Statistics
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. It helps in making informed decisions by quantifying uncertainty and variability, enabling us to understand patterns, trends, and relationships in data. Essential in various fields, statistics employs techniques like sampling, hypothesis testing, and regression for effective problem-solving and decision-making.<br>
<br>
In data science, statistics is crucial for deriving insights, identifying patterns, making predictions, and guiding data-driven decision-making through mathematical techniques.

### The Goal of Studying Statistics
To make sense of numerical information and draw conclusions based on that information.

### Examples of Statistics in Daily Life
- **Opinion Polls:** Opinion polls are a common example of statistics in daily life. Polling agencies use statistical methods to gather and analyze data on public opinions and attitudes towards various issues, such as politics, social trends, and consumer preferences.
- **Sports:** Statistics are commonly used in sports to analyze player performance and team strategies. For example, batting averages and earned run averages are common statistics used in baseball to evaluate players and teams.

- **Medical Research:** Statistics play an important role in medical research. Researchers use statistical methods to analyze data from clinical trials and other studies to determine the effectiveness of treatments, identify risk factors for diseases, and assess the overall health of populations.

- **Weather Forecasting:** Weather forecasting relies heavily on statistical methods to analyze large amounts of data from weather stations, satellites, and other sources. This data is used to create models that predict weather patterns and track severe weather events.

- **Education:** Statistics are used in education to analyze student performance, assess the effectiveness of teaching methods, and evaluate school programs. For example, standardized test scores are used to measure student achievement and identify areas where additional support is needed.

## Fundamentals of Statistics
The fundamentals of statistics involve essential principles and methods that establish a structured approach for gathering, examining, interpreting, and displaying data. These concepts, which include comprehending populations and samples, recognizing variables, determining parameters, and exploring distributions, work together to support well-informed choices and understanding in a wide array of fields.
1. **Population**: A population is the entire group of individuals, objects, or measurements that you want to study or make conclusions about.

2. **Sample**: A sample is a subset of the population that is selected for study. The sample is used to make inferences about the population.

3. **Variable**: A variable is any characteristic or attribute that can be measured or observed and can take on different values.

4. **Parameter**: A parameter is a numerical characteristic of a population. It is a fixed value that describes some aspect of the population, such as its mean, variance, or proportion.

5. **Distribution**: A distribution is a way of describing the pattern of values or observations in a data set. It shows how frequently different values or ranges of values occur in the data.

## Population

### Concept
Population refers to the entire group of individuals, objects, or events that we are interested in studying. This group can be as large or as small as we want, depending on the research question we are trying to answer.

### Examples
For example, if we want to study the health of all people living in a particular city, the population would be all individuals living in that city. This would include people of all ages, genders, races, and ethnicities.

Another example could be studying the average income of all households in a particular state. In this case, the population would be all households in that state, regardless of size or income level.

### Why it is Important?
Understanding the population is important because it helps us to define the scope of our research and make sure that the conclusions we draw from our study are applicable to the entire group we are interested in, and not just a subset. This is especially important when making decisions that affect large groups of people, such as public policy or marketing campaigns.

### Sample
A sample in statistics refers to a smaller group of individuals, objects, or events that we select from the larger population to study or analyze. 

### Concept
The goal of taking a sample is to draw conclusions about the entire population using a smaller, more manageable dataset.

### Examples
For example, if we want to study the health of all people living in a particular city, it may not be feasible or practical to survey or analyze data from the entire population. Instead, we could take a sample of individuals living in the city, such as a random selection of households or individuals, and analyze their health data. This sample would be representative of the larger population and would allow us to make conclusions about the health of the entire population.

Another example could be studying the average age of all employees at a particular company. Again, it may not be practical to collect data from every employee, so we could take a sample of employees and analyze their ages. This sample would be representative of the entire company and would allow us to make conclusions about the average age of all employees.

### Why it is Important?
Understanding the concept of sampling is important in statistics because it allows us to draw conclusions about larger populations without having to analyze data from every individual in the population. It also helps us to make informed decisions based on a smaller, more manageable dataset.

## Variable
Variables are often used in research studies to help us understand the relationship between different factors or to identify patterns in data.

### Concept
A variable is something that can change or vary. In statistics, a variable is a characteristic or attribute that we can measure or observe in order to learn more about a particular group of people, objects, or events.

Variables can be either numerical or categorical. Numerical variables are those that can be measured or expressed as a number, such as age, height, or weight. Categorical variables are those that represent a characteristic or attribute, such as gender, ethnicity, or favorite color.

### Examples
Example 1: If we want to study the effect of exercise on weight loss, we might use two variables: the amount of exercise someone does (measured in minutes per week) and their weight loss (measured in pounds). By measuring these two variables, we can identify patterns and relationships between exercise and weight loss.

Example 2: Another example could be studying the relationship between income and job satisfaction. In this case, income would be the variable that we are measuring and analyzing, and we would look at how income levels affect job satisfaction.

Example 3: If we want to study the relationship between studying and grades, we can use two variables: the amount of time someone studies (measured in hours per week) and their grades (measured as an A, B, C, or D). By measuring these two variables, we can identify patterns and relationships between studying and grades.

### Why it is Important?
The importance of variables in statistics lies in their ability to help us understand relationships between different factors. By identifying and measuring variables, we can gain valuable insights into how different characteristics or attributes relate to each other. This information can help us make decisions and solve problems in various areas, such as healthcare, business, and education.

## Parameter
A parameter in statistics is a numerical characteristic of a population that we are interested in studying. Parameters are used to describe some aspect of the population, such as its mean, variance, or proportion.

### Concept
A parameter is a number that tells us something about a whole group of things or people. For example, let's say we want to know how tall all the kids in a school are. We can't measure every single kid, so we might measure a smaller group of kids called a "sample". Based on that sample, we can estimate a parameter - in this case, the average height of all the kids in the school.

A parameter is different from a variable because a variable is something we can measure or observe about each individual in the sample. In our example, the variable is the height of each individual kid. By measuring this variable in a sample, we can estimate the parameter of the average height of all the kids in the school.

### Examples
For example, in a study of the average height of all people living in a city, height would be the variable that we measure in our sample of individuals. Based on the data from the sample, we can estimate the population parameter of the mean height.

### Why it is Important?
Overall, parameters are important in statistics because they allow us to draw conclusions about the entire population based on a sample of data. By estimating population parameters, we can make predictions or draw conclusions about the entire population based on the information we have gathered from the sample.

### How Variable and Parameter are Different?
A parameter is like a number that tells us something about a whole group of things or people. It's a way of summarizing or describing that group. For example, if we wanted to know the average height of all the kids in a school, the average height would be the parameter.

A variable, on the other hand, is something that can vary or change from person to person or thing to thing. For example, if we were measuring the height of each individual kid, that would be the variable.

So, in summary, a parameter is a summary number that describes a whole group of things or people, while a variable is something that can vary or change from person to person or thing to thing.

## Distribution
A distribution in statistics is a way of showing how the values of a variable are spread out or distributed among a group of individuals or objects. 

### Concept
A distribution can be shown visually using graphs, such as histograms or frequency polygons. These graphs can show the frequency of each value in the data and how the values are spread out.

### Examples
1. Grade distribution: If we were looking at the grades of all students in a class, the distribution would show how many students received each grade (e.g. A, B, C, D, or F).

2. Income distribution: If we were looking at the incomes of all people in a city, the distribution would show how many people earn each amount of money (e.g. $0-$25,000, $25,001-$50,000, $50,001-$75,000, etc.).

3. Age distribution: If we were looking at the ages of all people in a country, the distribution would show how many people are in each age group (e.g. 0-4, 5-9, 10-14, etc.).

4. Test score distribution: If we were looking at the test scores of all students in a school, the distribution would show how many students received each score (e.g. 0-20%, 21-40%, 41-60%, etc.).

5. Height distribution: If we were looking at the heights of all people in a country, the distribution would show how many people are of each height (e.g. 4'6"-4'11", 5'0"-5'5", 5'6"-5'11", etc.).

### Why it is Important?
Understanding the distribution of a variable is important in statistics because it allows us to identify patterns and trends in the data, which can help us make informed decisions based on the information we have.

## Types of Variables
Variables can be classified into two main categories: categorical and numeric.

They further are classified into four subcategories: nominal (categorical), ordinal (categorical), discrete (numeric), and continuous (numeric). Understanding the type of variable is important because it can affect the type of statistical analysis that can be used and how the data should be interpreted.






### Categorical
Categorical variables are a type of variable in statistics that represent characteristics or attributes that can be placed into categories. These categories are often named and have no numerical order.

For example, a categorical variable could be gender, where the categories are male and female. Another example is color, where the categories could be red, blue, green, and so on.

Categorical variables can be further divided into two types: nominal and ordinal. 

#### Nominal
A nominal variable is a type of categorical variable that describes a characteristic or attribute without any order or ranking. In other words, there is no specific order or hierarchy to the values of the variable.

For example, hair color is a nominal variable because there is no inherent order or ranking to the different colors. A person with red hair is not necessarily "better" or "higher" than a person with blonde hair. Other examples of nominal variables might include:

- Favorite color: There is no inherent order or ranking to different colors.
- Animal species: Different types of animals do not have an inherent order or ranking.
- Gender: There is no inherent order or ranking to the different genders.

#### Ordinal
An ordinal variable is a type of categorical variable that describes a characteristic or attribute with a natural order or ranking. In other words, there is a specific order or hierarchy to the values of the variable.

For example, sizes of clothing (small, medium, large) are an ordinal variable because there is a natural order or ranking to the sizes. A medium-sized shirt is "higher" or "bigger" than a small-sized shirt, but not as big as a large-sized shirt. Other examples of ordinal variables might include:

- Education level: Different levels of education (high school diploma, associate's degree, bachelor's degree, etc.) have a natural order or ranking.
- Income brackets: Different levels of income (low, middle, high) have a natural order or ranking.
- Grades: Different letter grades (A, B, C, D, F) have a natural order or ranking.




### Numerical
Numeric variables are a type of variable that represent numerical values. These values can be manipulated using mathematical operations.

For example, let's consider two numeric variables, weight and height. We can add, subtract, multiply, and divide these values to perform various calculations. Some examples of mathematical operations we can perform on numeric variables include:

#### Discrete
A discrete variable is a type of numeric variable that takes on specific, individual values. In other words, there are only certain possible values that the variable can have.

For example, the number of siblings a person has is a discrete variable because it can only take on specific integer values (0, 1, 2, 3, etc.). Other examples of discrete variables might include:

- The number of pets a person owns: This can only take on specific integer values (0, 1, 2, 3, etc.).
- The number of siblings in a family: This can only take on specific integer values (0, 1, 2, 3, etc.).
- The number of books on a shelf: This can only take on specific integer values (0, 1, 2, 3, etc.).

#### Continuous
A continuous variable is a type of variable that can take on any value within a certain range. This means that there are an infinite number of possible values that the variable can have.

For example, let's say we're talking about the height of a person. A person can be any height between a certain minimum and maximum value. So, someone can be 1.5 meters tall, or 1.6 meters tall, or 1.59999999999 meters tall. There are an infinite number of possible values for height because someone can be any height within a certain range.

Other examples of continuous variables might include:

- Temperature: The temperature can be any value between a certain minimum and maximum value. So, it can be 22 degrees Celsius, or 22.5 degrees Celsius, or 22.5555 degrees Celsius. There are an infinite number of possible values for temperature because it can be any value within a certain range.
- Time: Time can be any value between a certain minimum and maximum value. So, it can be 3:00 PM, or 3:01 PM, or 3:01:30 PM. There are an infinite number of possible values for time because it can be any value within a certain range.

## Test Yourself
---

### 1. Which of the following is an example of statistics in daily life?
- A. Baking a cake at home
- B. Watching a movie at the theater
- C. Conducting an opinion poll
- D. Going on a hike in the mountains

Answer 1: C. Conducting an opinion poll

>Explanation: Opinion polls are a common example of statistics in daily life, as polling agencies use statistical methods to gather and analyze data on public opinions and attitudes towards various issues.



### 2. What role do statistics play in medical research?
- A. Identifying the best hiking trails
- B. Evaluating the effectiveness of treatments
- C. Developing new recipes for cooking
- D. Planning travel itineraries

Answer 2: B. Evaluating the effectiveness of treatments

>Explanation: Statistics play an important role in medical research, as researchers use statistical methods to analyze data from clinical trials and other studies to determine the effectiveness of treatments, identify risk factors for diseases, and assess the overall health of populations.

### 3. What is a population in statistics?
- a) A sample of individuals, objects, or events that we select for study
- b) A subset of a larger population
- c) The entire group of individuals, objects, or measurements that you want to study or make conclusions about
- d) A numerical characteristic of a population

Answer: c) The entire group of individuals, objects, or measurements that you want to study or make conclusions about.

>Explanation: Population refers to the entire group of individuals, objects, or events that we are interested in studying.

### 4. What is the purpose of taking a sample in statistics?
- a) To draw conclusions about the entire population using a smaller dataset
- b) To measure or observe a characteristic or attribute that can take on different values
- c) To describe the pattern of values or observations in a data set
- d) To estimate a numerical characteristic of a population

Answer: a) To draw conclusions about the entire population using a smaller dataset.

> Explanation: The goal of taking a sample is to draw conclusions about the entire population using a smaller, more manageable dataset.

### 5. Which of the following is an example of a nominal variable?
- a) Height
- b) Weight
- c) Income
- d) Eye color

Answer: d) Eye color.

>Explanation: Eye color is a nominal variable because there is no inherent order or ranking to the different colors.

### 6. Which of the following is an example of an ordinal variable?
- a) Favorite color
- b) Number of siblings
- c) Education level
- d) Temperature

Answer: c) Education level.

>Explanation: Education level is an ordinal variable because different levels of education (high school diploma, associate's degree, bachelor's degree, etc.) have a natural order or ranking.

### 7. Which of the following is an example of a discrete variable?
- a) Time
- b) Height
- c) Temperature
- d) Number of pets

Answer: d) Number of pets.

> Explanation: The number of pets a person owns is a discrete variable because it can only take on specific integer values (0, 1, 2, 3, etc.).

### 8. Which of the following is an example of a continuous variable?
- a) Number of siblings
- b) Shoe size
- c) Number of books on a shelf
- d) Favorite color

Answer: b) Shoe size.

> Explanation: Shoe size is a continuous variable because it can take on any value within a certain range.

### 9. What is a parameter in statistics?
- a) A way of showing how the values of a variable are spread out or distributed among a group of individuals or objects
- b) A type of variable that represents characteristics or attributes that can be placed into categories
- c) A numerical characteristic of a population
- d) A subset of the population that is selected for study

Answer: c) A numerical characteristic of a population.

>Explanation: A parameter is a numerical characteristic of a population. It is a fixed value that describes some aspect of the population, such as its mean, variance, or proportion.

### 10. Why is understanding the distribution of a variable important in statistics?
- a) It helps us identify patterns and trends in the data
- b) It allows us to draw conclusions about the entire population based on a sample of data
- c) It helps us to make informed decisions based on a smaller, more manageable dataset
- d) It allows us to measure or observe a characteristic or attribute that can take on different values

Answer: a) It helps us identify patterns and trends in the data.

>Explanation: Understanding the distribution of a variable is important in statistics because it allows us to identify patterns and trends in the data, which can help us make informed decisions based on the information we have.

### 11. Which of the following is a numerical variable?
- a) Gender
- b) Favorite color
- c) Height
- d) Education level

Answer: c) Height. Numerical variables are those that represent numerical values, such as height, weight, or age.

>Explanation: Gender (a) and favorite color (b) are both examples of categorical variables, while education level (d) is a combination of both categorical and numerical variables. Height (c) is a numerical variable because it represents a value that can be measured and manipulated mathematically.

### 12. What is the difference between a parameter and a variable?
- a) A parameter describes a numerical characteristic of a population, while a variable is any characteristic or attribute that can be measured or observed.
- b) A parameter is a type of variable that can take on numerical values, while a variable is a summary number that describes a population.
- c) A parameter is a variable that can be observed or measured, while a variable is a numerical characteristic of a population.
- d) There is no difference between a parameter and a variable.

Answer: a) A parameter describes a numerical characteristic of a population, while a variable is any characteristic or attribute that can be measured or observed.

>Explanation: A parameter is a numerical characteristic of a population, such as its mean, variance, or proportion, while a variable is any characteristic or attribute that can be measured or observed, such as height, weight, or income.

### 13. Which of the following is an example of a nominal variable?
- a) Age
- b) Education level
- c) Height
- d) Eye color

Answer: d) Eye color. Nominal variables are a type of categorical variable that describes a characteristic or attribute without any order or ranking, such as eye color, favorite color, or animal species.

>Explanation: Age (a), education level (b), and height (c) are all examples of variables that can be measured numerically, and therefore are not nominal variables. Eye color (d) is a nominal variable because there is no inherent order or ranking to the different colors.

### 14. What is the difference between a discrete and a continuous variable?
- a) A discrete variable is a variable that can take on any value within a certain range, while a continuous variable takes on specific, individual values.
- b) A discrete variable is a type of categorical variable, while a continuous variable is a numerical variable.
- c) A discrete variable takes on specific, individual values, while a continuous variable can take on any value within a certain range.
- d) There is no difference between a discrete and a continuous variable.

Answer: c) A discrete variable takes on specific, individual values, while a continuous variable can take on any value within a certain range.

> Explanation: Discrete variables are a type of numerical variable that takes on specific, individual values, such as the number of siblings a person has or the number of pets they own. Continuous variables, on the other hand, can take on any value within a certain range, such as height, weight, or temperature.

### 15. What is a distribution in statistics?
- a) A way of showing how the values of a variable are spread out or distributed among a group of individuals or objects
- b) A way of measuring the relationship between two variables
- c) A way of defining a population
- d) A way of selecting a sample

Answer: a) A way of showing how the values of a variable are spread out or distributed among a group of individuals or objects. 

>Explanation: A distribution in statistics is a way of describing the pattern of values or observations in a data set. It shows how frequently different values or ranges of values occur in the data.