### 1. What are the key tasks involved in getting ready to work with machine learning modeling?

There are several key tasks involved in getting ready to work with machine learning modeling:

1. **Define the problem:** The first step is to define the problem that you want to solve with machine learning. Clearly defining the problem will help you determine the type of data you need and the type of algorithm you should use.

2. **Gather data:** Once you have defined the problem, the next step is to gather the necessary data. You need to ensure that the data you collect is accurate, relevant, and representative of the problem you are trying to solve.

3. **Data cleaning and preprocessing:** Before you can use the data for modeling, you need to clean and preprocess it. This includes tasks such as removing missing values, handling outliers, scaling the data, and encoding categorical variables.

4. **Feature engineering:** Feature engineering involves selecting and transforming the relevant features in the data to improve the performance of the model. This may involve creating new features, selecting the most important features, and transforming the features to better suit the model.

5. **Model selection:** There are various machine learning models to choose from, and you need to select the one that is best suited to your problem. This involves understanding the strengths and weaknesses of different models and selecting the one that best meets your needs.

6. **Model training:** Once you have selected the model, you need to train it on the data. This involves feeding the model with the input data and adjusting the model's parameters to minimize the error.

7. **Model evaluation:** After training the model, you need to evaluate its performance to see how well it is doing. This involves testing the model on a separate test dataset and comparing its predicted output with the actual output.

8. **Model deployment:** Finally, you need to deploy the model in a production environment, where it can be used to make predictions on new data. This involves integrating the model with other software and ensuring that it is scalable and reliable.

### 2. What are the different forms of data used in machine learning? Give a specific example for each of them.

**There are three main forms of data used in machine learning:**

**Numeric data:** Numeric data is represented by numbers and can be either continuous or discrete. Examples of continuous numeric data include height, weight, and temperature, while examples of discrete numeric data include the number of people in a household, the number of cars in a parking lot, and the number of items in a shopping cart.

**Categorical data:** Categorical data is represented by labels or categories, and can be nominal or ordinal. Nominal categorical data has no natural order or ranking, such as colors or breeds of dogs. Ordinal categorical data has a natural order or ranking, such as education level (elementary, high school, college).

**Text data:** Text data is represented by natural language text, such as emails, social media posts, or product reviews. Text data can be transformed into numerical form through techniques such as tokenization, stemming, and vectorization, so that it can be used in machine learning models.

**Examples of each type of data:**

**Numeric data:** A machine learning model to predict housing prices might use numeric data such as the number of bedrooms, square footage, and age of the house.

**Categorical data:** A machine learning model to predict customer churn might use categorical data such as the customer's gender, location, and subscription plan.

**Text data:** A machine learning model to classify movie reviews as positive or negative might use text data from the reviews themselves. For example, the model might analyze the sentiment of the words used in the review to make its prediction.

### 3. Distinguish:

1. Numeric vs. categorical attributes

2. Feature selection vs. dimensionality reduction

The following are the differences between:

1. **Numeric vs. categorical attributes:**
* Numerical data are values obtained for quantitative variable, and carries a sense of magnitude related to the context of the variable (hence, they are always numbers or symbols carrying a numerical value).
* Categorical data are values obtained for a qualitative variable. categorical data numbers do not carry a sense of magnitude.
* Numerical data always belong to either ordinal, ratio, or interval type, whereas categorical data belong to nominal type. Methods used to analyse quantitative data are different from the methods used for categorical data, even if the principles are the same at least the application has significant differences.
* Numerical data are analysed using statistical methods in descriptive statistics, regression, time series and many more.For categorical data usually descriptive methods and graphical methods are employed. Some non-parametric tests are also used.
2. **Feature selection vs. dimensionality reduction:**
* Feature selection you just select a subset of the original feature set, without any manipulation of the data on the other hand.
* Dimensionality reduction is typically choosing a new representation within which you can describe most but not all of the variance within your data, thereby retaining the relevant information, while reducing theamount of information necessary to represent it.

### 4. Make quick notes on any two of the following:

1. The histogram

2. Use a scatter plot

3. PCA

The Quick notes on the following three topics is:

**The histogram:** A Histogram is a graphical representation that organizes a group of data points into user-specified ranges. Similar in appearance to a bar graph, the histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins.

**Use a scatter plot:** A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

**PCA:** Principal Component Analysis or PCA is a widely used technique for dimensionality reduction of the large data set. Reducing the number of components or features costs some accuracy and on the other hand, it makes the large data set simpler, easy to explore and visualize.

### 5. Why is it necessary to investigate data? Is there a discrepancy in how qualitative and quantitative data are explored?

If your data set is messy, building models will not help you to solve your problem. What will happen is Garbage In, Garbage Out. In order to build a powerful machine learning algorithm. We need to explore and understand our data set before we define a predictive task and solve it.

### 6. What are the various histogram shapes? What exactly are ‘bins&#39;?

The different types of a Histogram are:

1. Uniform Histogram
2. Symmetric Histogram
3. Bimodal Histogram
4. Probability Histogram.

The bin in a histogram is the choice of unit and spacing on the X-axis. All the data in a probability distribution represented visually by a histogram is filled into the corresponding bins. The height of each bin is a measurement of the frequency with which data appears inside the range of that bin in the distribution.

### 7. How do we deal with data outliers?

We can use Z-Score or any of below methods to deal with data outliers:

**Univariate Method:** This method looks for data points with extreme values on one variable.

**Multivariate Method:** Here, we look for unusual combinations of all the variables.

**Minkowski Error:** This method reduces the contribution of potential outliers in the training process.

**Z-Score:** This can be done with just one line code as we have already calculated the Z-score.

boston_df_o = boston_df_o[(z < 3).all(axis=1)]

**IQR Score:** Calculate IQR score to filter out the outliers by keeping only valid values.

boston_df_out = boston_df_o1[~((boston_df_o1 < (Q1 - 1.5 * IQR)) |(boston_df_o1 > (Q3 + 1.5 *IQR))).any(axis=1)]

boston_df_out.shape

**Quantile function:** Use quantile() to remove amount of data.

### 8. What are the various central inclination measures? Why does mean vary too much from median in certain data sets?

Mean, Median and Mode are Central Inclination Measures. Mean varies more than Median due to presence of outliers, as mean is averaging all points while median in like finding a middle number.

### 9. Describe how a scatter plot can be used to investigate bivariate relationships. Is it possible to find outliers using a scatter plot?

A Scatter Plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. So this visualization gives us the idea of bivariate relationship.

Scatter plot can also help finding outliers as outliers can be visualized at farther distance than regular data.

### 10. Describe how cross-tabs can be used to figure out how two variables are related.

Cross tabulation is a method to quantitatively analyze the relationship between multiple variables. Also known as contingency tables or cross tabs, cross tabulation groups variables to understand the correlation between different variables. It also shows how correlations change from one variable grouping to another.