**1. What are the key tasks involved in getting ready to work with machine learning modeling?**

**Ans:** Key tasks in preparing for machine learning modeling are:
1. Problem Definition
2. Data Collections
3. Data Preprocessing
4. Feature Engineering
5. Splitting train-test data
6. Model Selection
7. Model Training
8. Model Evaluation 
9. Deployment


**2. What are the different forms of data used in machine learning? Give a specific example for each of them.**

**Ans:** There are various forms of data used in machine learning, each serving different purposes. Here's an example for each type:

1. **Structured Data:**    
    - Structured data is organized into rows and columns, typically stored in databases or spreadsheets.
    - Example: A sales dataset with columns for customer names, purchase dates, and amounts.
2. **Unstructured Data:**    
    - Unstructured data lacks a predefined structure and includes text, images, audio, and video.
    - Example: Social media posts or customer reviews where information isn't organized in a tabular format.
3. **Semi-Structured Data:**    
    - Semi-structured data is organized loosely and often contains tags or labels.
    - Example: JSON or XML files containing product data, where attributes are labeled but not in a rigid structure.
4. **Time Series Data:**    
    - Time series data consists of sequential measurements over time.
    - Example: Stock price data recorded daily over a year, showing how prices change over time.
5. **Categorical Data:**    
    - Categorical data represents different categories or labels.
    - Example: Colors (red, green, blue) or types of products (electronics, clothing, books).
6. **Numerical Data:**    
    - Numerical data consists of continuous or discrete numeric values.
    - Example: Temperatures, heights, ages, or any measurable quantities.
7. **Text Data:**    
    - Text data consists of strings of characters.
    - Example: News articles, emails, and tweets used for sentiment analysis or text classification.
8. **Image Data:**    
    - Image data consists of pixel values representing visual content.
    - Example: Photos, medical images, or satellite images used in image recognition tasks.
9. **Audio Data:**    
    - Audio data represents sound signals.
    - Example: Speech recordings used in speech recognition or music audio for genre classification.
10. **Video Data:**    
    - Video data combines sequential images and audio.
    - Example: Surveillance footage or video streams for action recognition.

**3. Distinguish:**
**1. Numeric vs. categorical attributes**

|Aspect|Numeric Attributes|Categorical Attributes|
|---|---|---|
|Type of Data|Continuous or discrete numeric values|Discrete categories or labels|
|Examples|Age, temperature, height|Colors, product categories, city names|
|Operations|Arithmetic operations (e.g., addition)|Counting occurrences, mode calculation|
|Preprocessing|Scaling, normalization|One-hot encoding, label encoding|
|Machine Learning|Directly usable as features|Need conversion to numerical representations|

**2. Feature selection vs. dimensionality reduction**

|Aspect|Feature Selection|Dimensionality Reduction|
|---|---|---|
|Goal|Select relevant features|Reduce feature space while retaining info|
|Purpose|Enhance model performance, reduce noise|Address curse of dimensionality, speed up learning|
|Approach|Subset selection, ranking, importance|Principal Component Analysis (PCA), t-SNE|
|Feature Subset|Subset of original features|May involve creating new transformed features|
|Retention of Info|Focus on retaining most relevant features|Retains essential information across features|
|Application|When computational resources are limited|When dealing with high-dimensional data|

**4. Make quick notes on any two of the following:**

**1. The histogram**

- Histogram is a graphical representation of data distribution.
- It displays the frequency of data values within specified bins or intervals.
- Useful for understanding data's central tendency, spread, and skewness.
- Helps identify patterns, outliers, and potential data issues.

**2. Use a scatter plot**

- Scatter plot displays individual data points on a two-dimensional plane.
- Useful for visualizing relationships between two continuous variables.
- Shows patterns, clusters, correlations, or outliers in data.
- Provides insights into trends, dependencies, and potential associations.


**3.PCA (Principal Component Analysis)**

- PCA is a dimensionality reduction technique.
- Reduces data dimensions while preserving its variation.
- Used to transform correlated variables into uncorrelated principal components.
- Applications include feature reduction, noise reduction, and data visualization.

**5. Why is it necessary to investigate data? Is there a discrepancy in how qualitative and quantitative data are explored?**

**Ans:** Investigating data is essential to gain a comprehensive understanding of its characteristics, patterns, and potential issues before applying machine learning or statistical analysis. Data exploration ensures accuracy, feature selection, bias detection, pattern recognition, and informed model selection.

- **Qualitative Data:** Explored for frequency, mode, and relationships using techniques like bar plots, pie charts, and cross-tabulations.    
- **Quantitative Data:** Explored for central tendency, spread, and distribution using histograms, box plots, and scatter plots.

Both types benefit from identifying outliers, assessing missing values, and analyzing correlations.

**6. What are the various histogram shapes? What exactly are ‘bins'?**

**Ans:** Histograms can take various shapes, each providing insights into the distribution of data. The shapes include:

1. **Normal Distribution (Bell Curve):** Symmetric with data clustered around the mean, common in nature.
2. **Skewed Right (Positively Skewed):** Tail extends towards the right, few large values on the right.
3. **Skewed Left (Negatively Skewed):** Tail extends towards the left, few small values on the left.
4. **Bimodal:** Two distinct peaks, indicates presence of two groups.
5. **Multimodal:** Multiple peaks, suggests multiple groups or processes.
6. **Uniform:** Similar frequencies across values, no distinct trend.

**Bins:** Bins are intervals into which data is divided in a histogram. Each bin represents a range of values, and the frequency of data falling within that range is shown on the vertical axis. Selecting an appropriate number of bins is crucial, as too few may oversimplify the distribution, while too many can obscure patterns. Bins help visualize the distribution's shape, central tendency, and spread, making it easier to interpret data patterns.

**7. How do we deal with data outliers?**

**Ans:** Depending on the specific characteristics of the data, there are several ways to handle outliers in a dataset. Let’s review a few of the most common approaches to handle outliers below:

- **Remove outliers:**  
    In some cases, it may be appropriate to simply remove the observations that contain outliers. This can be particularly useful if you have a large number of observations and the outliers are not true representatives of the underlying population.
- **Transform outliers:**  
    The impact of outliers can be reduced or eliminated by transforming the feature. For example, a log transformation of a feature can reduce the skewness in the data, reducing the impact of outliers.
- **Impute outliers:**  
    In this case, outliers are simply considered as missing values. You can employ various imputation techniques for missing values, such as mean, median, mode, nearest neighbor, etc., to impute the values for outliers.
- **Use robust statistical methods:**  
    Some of the statistical methods are less sensitive to outliers and can provide more reliable results when outliers are present in the data. For example, we can use median and IQR for the statistical analysis as they are not affected by the outlier’s presence. This way we can minimize the impact of outliers in statistical analysis.

**8. What are the various central inclination measures? Why does mean vary too much from median in certain data sets?**

**Ans:** Central inclination measures, also known as measures of central tendency, help describe the center or average of a dataset. The main central inclination measures are:

1. **Mean (Average):**    
    - The sum of all values divided by the total number of values.
    - Sensitive to outliers, as extreme values can significantly affect it.
2. **Median:**    
    - The middle value when data is ordered.
    - Less affected by outliers, making it a robust measure of central tendency.
3. **Mode:**    
    - The value that appears most frequently in the dataset.
    - Can be used for both qualitative and quantitative data.

The mean is sensitive to outliers and skewed data, while the median is robust to these influences. Therefore, the mean can vary significantly from the median in situations where extreme values or skewed distributions exist.

**9. Describe how a scatter plot can be used to investigate bivariate relationships. Is it possible to find outliers using a scatter plot?**

**Ans:** A scatter plot is a powerful visualization tool used to investigate bivariate relationships between two variables. It plots individual data points as dots on a two-dimensional plane, where each axis represents a different variable. Scatter plots are particularly useful for understanding patterns, correlations, and potential outliers in data.

**Using a Scatter Plot to Investigate Bivariate Relationships:**

- **Patterns:** Scatter plots help identify patterns such as linear relationships, clusters, or curvilinear trends between two variables.
- **Correlations:** The distribution of points can reveal the strength and direction of the correlation between variables. Positive correlations show points moving in the same direction, while negative correlations show points moving in opposite directions.
- **Outliers:** Outliers, which are data points significantly deviating from the general trend, are often visible in scatter plots. They can indicate data errors or provide valuable insights.

**Finding Outliers Using a Scatter Plot:**

- **Outliers Detection:** Scatter plots make outliers easier to spot as they stand out from the main cluster of points.
- **Visual Inspection:** By observing data points that deviate from the general trend, you can identify potential outliers.
- **Context Matters:** Consider whether outliers are due to data errors, measurement inaccuracies, or genuine extreme values. Consult domain experts to validate their significance.

**10. Describe how cross-tabs can be used to figure out how two variables are related.**

**Ans:** Cross-tabulation (cross-tab) is used to explore how two categorical variables are related. It involves creating a table that counts the occurrences of combinations of categories from both variables. By analyzing patterns in the table, you can determine if there's an association between the variables. This technique helps identify trends, dependencies, and connections within categorical data. For example, cross-tabs can reveal if certain genders tend to prefer specific car types.