# **Project Name**    - House sale data



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**Anubhav Bansal
##### **Team Member 2 -**Ansh Puri
##### **Team Member 3 -**Rohan Lakhanpal
##### **Team Member 4 -**Rishit

# **Project Summary -**

Predicting the sales price of houses is a critical task within the real estate market, directly impacting buyers, sellers, and industry professionals. This project aims to develop a predictive model capable of estimating the sale price for houses given a set of features or characteristics. The prediction will be based on a test set where each house is identified by an Id, and the goal is to forecast the value of the SalePrice variable accurately.

Predicting the sale price of houses is a complex but valuable task that can significantly impact the real estate market. Through careful data preprocessing, thoughtful feature engineering, strategic model selection, and rigorous validation and evaluation, it is possible to develop a predictive model that can accurately estimate the sales price of houses, thereby aiding stakeholders in making informed decisions.
# > Indented block



# **GitHub Link -**

https://github.com/anubhav9087/AIML

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.


**Write Problem Statement Here.**  The ultimate goal is to predict the sales price for each house in the test set, identified by its unique Id, thereby providing valuable insights for stakeholders in making informed decisions in the real estate market.






# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# For data manipulation and analysis
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style for plots
sns.set()


### Dataset Loading

In [None]:
# Load the dataset
dataset_path = '/content/house_sale_data.csv'
df = pd.read_csv(dataset_path)

# Display the first few rows of the dataframe
print(df.head())




```
# This is formatted as code
```

### Dataset First View

In [None]:
print(df.head())

# Print the size of the dataset (rows, columns)
print(f"\nThe dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")



```
# This is formatted as code
```

### Dataset Rows & Columns count

In [None]:
print(f"The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")







```
# This is formatted as code
```

### Dataset Information

In [None]:
print(df.info())


#### Duplicate Values

In [None]:
duplicate_count = df.duplicated().sum()
print(f"There are {duplicate_count} duplicate rows in the dataset.")

#### Missing Values/Null Values

In [None]:
missing_values_count = df.isnull().sum()

print("Missing Values Count:")
print(missing_values_count)

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False, yticklabels=False)
plt.title('Missing Values in the Dataset')
plt.show()





Size of the Dataset: The dataset contains a total of X rows and Y columns.

Data Types: We have information about the data types of each column in the dataset, which includes numerical, categorical, and potentially datetime data types.

Missing Values: We identified the presence of missing values in certain columns of the dataset. These missing values need to be handled appropriately during data preprocessing.

Duplicate Values: We checked for duplicate rows in the dataset and found that there are Z duplicate rows.

Visualizing Missing Values: We visualized the missing values using a heatmap, which provides a clear overview of the missing data distribution across different columns.

Answer Here

## ***2. Understanding Your Variables***

In [None]:
columns_list = df.columns.tolist()

print("Columns in the Dataset:")
print(columns_list)





In [None]:
dataset_description = df.describe()

print("Statistical Summary of the Dataset:")
print(dataset_description)

variables_description = {
    'Variable1': 'Description of Variable1',
    'Variable2': 'Description of Variable2',
    # Add descriptions for all variables in your dataset
}

# Print variable descriptions
print("Variables Description:")
for variable, description in variables_description.items():
    print(f"{variable}: {description}")

Answer Here

### Check Unique Values for each variable.

In [None]:
unique_values = df.nunique()

print("Unique Values for each Variable:")
print(unique_values)





## 3. ***Data Wrangling***

*italicized text*### Data Wrangling Code

In [None]:
print("First 5 rows of the dataset:")
print(df.head())

# Check for duplicate rows and remove them
print("\nRemoving duplicate rows...")
df.drop_duplicates(inplace=True)
print("Duplicate rows removed.")

# Check for missing values and handle them
print("\nHandling missing values...")
print("Number of missing values in each column:")
print(df.isnull().sum())

# Drop rows with missing values (or fill missing values using df.fillna() if appropriate)
df.dropna(inplace=True)
print("Missing values handled.")

# Check the data types of each column
print("\nData types of each column:")
print(df.dtypes)

# Convert data types if needed (e.g., using df.astype() or pd.to_datetime())

# Perform additional data wrangling steps as needed (e.g., renaming columns, reordering columns)

# Save the cleaned dataset
df.to_csv('cleaned_data.csv', index=False)
print("\nCleaned dataset saved as 'cleaned_data.csv'.")

The number of duplicate rows, if any, in the dataset.
The distribution of missing values across different columns, which helps in understanding the completeness of the data.
The initial data types of each column, which may indicate potential inconsistencies or the need for conversions.
The cleaned dataset, ready for further analysis, ensuring that only high-quality data is used for decision-making or modeling purposes.





Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')

# Plot a histogram of the 'SalePrice' variable
plt.figure(figsize=(10, 6))
plt.hist(df['SalePrice'], bins=30, color='skyblue', edgecolor='black')
plt.title('Histogram of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


n this case, understanding the distribution of sale prices can provide valuable insights into the range, central tendency, and spread of prices in the dataset. This information is essential for various analytical tasks, such as identifying outliers, understanding the typical price range, and assessing the skewness or symmetry of the distribution.

Histograms also allow for easy comparison of frequency counts across different price ranges, making them particularly useful for exploring the distribution of continuous variables like sale prices. Overall, a histogram is a suitable choice for gaining a quick understanding of the distribution of sale prices in the dataset.

##### 2. What is/are the insight(s) found from the chart?

> Indented block



Distribution Shape: The shape of the histogram can provide insights into the overall distribution of sale prices. For example, a symmetric distribution indicates that sale prices are evenly distributed around the mean, while a skewed distribution suggests that sale prices are more concentrated towards one end of the range.

Central Tendency: The central tendency of sale prices can be inferred from the peak or mode of the histogram. The mode represents the most frequently occurring sale price, providing an indication of the typical price range in the dataset.

Range of Sale Prices: The range of sale prices covered by the dataset can be observed from the horizontal axis of the histogram. This helps in understanding the variability and diversity of sale prices in the dataset.

Outliers: Any outliers or unusual observations in the distribution of sale prices can be identified visually as data points that lie far from the main bulk of the distribution.

Data Sparsity: Gaps or sparse regions in the histogram may indicate areas of the sale price range where there are fewer observations, highlighting potential areas for further investigation or data collection.

Overall, the histogram of sale prices provides a comprehensive overview of the distributional characteristics of sale prices in the dataset, aiding in understanding the underlying patterns and variability within the data

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Understanding Customer Preferences: By analyzing the distribution of sale prices, businesses can gain insights into customer preferences and purchasing behavior. This understanding can help in tailoring marketing strategies, product offerings, and pricing strategies to better meet customer needs, leading to increased sales and customer satisfaction.

Identifying Market Trends: Analysis of sale price distribution can reveal trends in the real estate market, such as increasing demand for certain types of properties or fluctuations in property values over time. This information can inform strategic decisions related to investment opportunities, market positioning, and portfolio management.

Optimizing Pricing Strategies: Businesses can use insights from the sale price distribution to optimize pricing strategies, such as setting competitive prices, offering discounts or promotions, and adjusting pricing tiers based on market demand and customer preferences. This can lead to improved revenue generation and profitability.

Negative Impacts:

Risk of Overpricing or Underpricing: Failure to accurately assess market trends and customer preferences based on the sale price distribution may result in overpricing or underpricing of properties. Overpricing can lead to decreased demand and longer listing times, while underpricing can result in lost revenue and reduced profitability.

Missed Opportunities: Failure to identify outliers or niche market segments within the sale price distribution may result in missed opportunities for business growth. For example, overlooking high-demand properties or undervalued market segments could lead to lost sales and market share.

Customer Dissatisfaction: Inaccurate pricing strategies based on incomplete or misinterpreted insights from the sale price distribution may lead to customer dissatisfaction. This can result in negative reviews, reduced customer loyalty, and ultimately, decreased sales and revenue.

In conclusion, while insights gained from analyzing the sale price distribution can lead to positive business impacts such as improved customer targeting and pricing optimization, failure to interpret the data accurately or respond effectively to market trends may result in negative growth outcomes. It is essential for businesses to leverage data-driven insights responsibly and adapt their strategies accordingly to maximize positive impacts and mitigate potential risks.






#### Chart - 2

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')

# Select the first five columns for visualization
columns_to_visualize = df.columns[:5]

# Create a bar chart for each column
plt.figure(figsize=(12, 6))
for column in columns_to_visualize:
    column_counts = df[column].value_counts()
    column_counts.plot(kind='bar', color='skyblue')
    plt.title('Distribution of Houses in ' + column)
    plt.xlabel(column)
    plt.ylabel('Number of Houses')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

I selected a bar chart because it is effective for visualizing the distribution of categorical data, which is common in datasets like the one provided. In this case, we are interested in understanding the distribution of houses across different categories or groups within the dataset.

By using a bar chart, we can easily compare the frequency or count of houses in each category, providing a clear visual representation of the distribution. This helps in identifying patterns, trends, and disparities within the data, which is crucial for gaining insights and making data-driven decisions.

Additionally, a bar chart allows for easy customization, such as adjusting the color scheme, orientation of labels, and other visual elements, to enhance readability and convey information effectively. Overall, a bar chart is a suitable choice for visualizing categorical data distributions and is commonly used for exploratory data analysis.






##### 2. What is/are the insight(s) found from the chart?

From the bar chart, we can gain several insights about the distribution of houses in the dataset:

1. **Variety of Categories**: We can observe the different categories present in each column of the dataset. For example, in the 'MSSubClass' column, we see various numerical categories representing different types of dwellings (e.g., 20 for 1-story 1946 & newer all styles).

2. **Frequency of Each Category**: The height of each bar represents the frequency or count of houses belonging to each category. By comparing the heights of the bars, we can identify which categories are more prevalent and which are less common.

3. **Skewness or Imbalance**: Disparities in the heights of the bars indicate skewness or imbalance in the distribution of houses across categories. For instance, if one category has a significantly larger count compared to others, it suggests that the dataset is skewed towards that category.

4. **Identification of Outliers**: Unusually tall bars or categories with very low counts may indicate outliers or rare occurrences within the dataset. These outliers could be further investigated to understand their significance or potential impact on analysis.

5. **Potential Patterns or Trends**: Patterns or trends may emerge from the distribution of houses across categories. For instance, certain neighborhoods may have a higher concentration of houses, certain building types may be more common, or certain zoning classifications may dominate the dataset.

Overall, the bar chart provides a visual summary of the distribution of houses across different categories, allowing us to identify patterns, trends, and potential areas for further analysis.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In conclusion, while the insights gained from analyzing the distribution of houses can contribute to positive business impact by informing decision-making and enabling targeted strategies, businesses must also be mindful of potential pitfalls and risks associated with narrow focus, failure to adapt, and misallocation of resources. It is essential to strike a balance between leveraging insights for growth and mitigating potential negative impacts through strategic planning and proactive management.

#### Chart - 3

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')

# Plot a histogram for 'OverallQual'
plt.figure(figsize=(10, 6))
plt.hist(df['OverallQual'], bins=range(1, 11), color='skyblue', edgecolor='black', alpha=0.7)
plt.title('Distribution of Houses by Overall Quality')
plt.xlabel('Overall Quality')
plt.ylabel('Number of Houses')
plt.xticks(range(1, 11))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


[link text](https://)##### 1. Why did you pick the specific chart?

I chose a histogram because it effectively illustrates the distribution of a continuous variable, such as the overall quality of houses in this case. Histograms are particularly suitable for displaying the frequency or count of observations within different intervals or bins of the variable.

For this dataset, the 'OverallQual' column represents a continuous variable that indicates the overall quality of each house on a scale from 1 to 10. By using a histogram, we can visualize how the houses are distributed across different quality ratings, providing insights into the prevalence of certain quality levels within the dataset.

##### 2. What is/are the insight(s) found from the chart?

Overall, the histogram provides a visual summary of the distribution of houses based on their overall quality ratings, offering insights into the composition of the dataset and potential trends within the real estate market represented by the data.





Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In conclusion, while the insights gained from the histogram can facilitate positive business outcomes by informing decision-making and strategy development, businesses must also be cautious of potential pitfalls such as overlooking niche markets, ignoring market trends, and misinterpreting quality perceptions. Striking a balance between leveraging quality distribution insights and adapting to market dynamics is essential for sustained growth and competitiveness in the real estate industry.






#### Chart - 4

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')

# Plot a scatter plot for 'OverallQual' vs 'SalePrice'
plt.figure(figsize=(10, 6))
plt.scatter(df['OverallQual'], df['SalePrice'], color='blue', alpha=0.5)
plt.title('Overall Quality vs Sale Price')
plt.xlabel('Overall Quality')
plt.ylabel('Sale Price')
plt.grid(True)
plt.tight_layout()
plt.show()




```
# This is formatted as code
```

##### 1. Why did you pick the specific chart?

I chose a scatter plot because it is an effective way to visualize the relationship between two continuous variables, such as 'OverallQual' (overall quality) and 'SalePrice' (sale price) in this case. Scatter plots allow us to observe patterns, trends, and correlations between the variables by displaying individual data points on a Cartesian plane. This type of chart is particularly useful for exploring the association between variables and identifying any potential linear or nonlinear relationships between them.







[link text](https://)##### 2. What is/are the insight(s) found from the chart?

Overall, the scatter plot provides valuable insights into the relationship between overall quality and sale prices of houses in the dataset, aiding in understanding the factors influencing property values and informing pricing strategies in the real estate market.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In conclusion, while the insights from the scatter plot can inform strategic decision-making and enhance business performance in the real estate industry, it's crucial to consider a holistic approach to pricing and valuation, incorporating various factors beyond just overall quality ratings to ensure sustainable growth and profitability.






#### Chart - 5

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')

# Plot a histogram for 'SalePrice'
plt.figure(figsize=(10, 6))
plt.hist(df['SalePrice'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific

1.   List item
2.   List item

chart?

I chose a histogram because it is an effective way to visualize the distribution of a continuous variable, such as 'SalePrice' in this case. Histograms provide insights into the central tendency, spread, and shape of the data distribution. They allow us to understand how the sale prices are distributed across different price ranges and identify any patterns or anomalies in the data distribution. This visualization is particularly useful for understanding the overall distribution of sale prices in the dataset and can help in assessing the market trends and pricing strategies in the real estate industry.

##### 2. What is/are the insight(s) found from the chart?

Overall, the histogram provides a comprehensive overview of the distribution of sale prices in the dataset, enabling stakeholders to understand the market landscape, identify pricing trends, and make informed decisions regarding property valuation and pricing strategies.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In conclusion, while insights from the histogram of sale prices can inform strategic decision-making and enhance business performance, it's essential for businesses to interpret the data accurately and adapt their strategies accordingly to mitigate potential risks and capitalize on growth opportunities in the real estate market.






#### Chart - 6

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')

# Plot a scatter plot for 'GrLivArea' vs 'SalePrice'
plt.figure(figsize=(10, 6))
plt.scatter(df['GrLivArea'], df['SalePrice'], color='skyblue', alpha=0.6)
plt.title('Relationship between Above Grade Living Area and Sale Price')
plt.xlabel('Above Grade Living Area (sq ft)')
plt.ylabel('Sale Price ($)')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a scatter plot because it is ideal for visualizing the relationship between two continuous variables. In this case, I want to explore the relationship between the above grade living area ('GrLivArea') and the sale price ('SalePrice') of properties. A scatter plot allows us to observe patterns, trends, and correlations between these variables by plotting each data point according to its corresponding values on the x-axis (living area) and y-axis (sale price). This visualization method helps in understanding how changes in one variable affect the other and provides insights into potential associations between them.

[link text](https:// [link text](https://))##### 2. What is/are the insight(s) found from the chart?

Overall, the scatter plot provides a visual representation of the relationship between above grade living area and sale price, allowing stakeholders to gain insights into pricing patterns and trends in the real estate market.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer HereIn conclusion, while the insights gained from the scatter plot can inform strategic decision-making and enhance business performance in the real estate industry, it's crucial for stakeholders to interpret the data accurately and adopt a balanced approach to pricing and market positioning to mitigate potential risks and foster positive growth.






#### Chart - 7

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')

# Plot a histogram of 'SalePrice'
plt.figure(figsize=(10, 6))
plt.hist(df['SalePrice'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price ($)')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose to create a histogram to visualize the distribution of sale prices because it provides a clear and concise representation of the frequency of different sale price ranges within the dataset. Histograms are particularly useful for understanding the central tendency, variability, and shape of a continuous variable's distribution, making them well-suited for exploring the distribution of sale prices in real estate data.






[link text](https://)##### 2. What is/are the insight(s) found from the chart?

Overall, the histogram provides a comprehensive overview of the distribution of sale prices, allowing stakeholders to gain insights into market dynamics, pricing trends, and potential opportunities or challenges within the real estate market.







##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In conclusion, while the insights gained from the histogram visualization of sale prices offer valuable opportunities for informed decision-making and strategic planning, businesses must carefully analyze and interpret these insights to mitigate risks and leverage opportunities effectively, ensuring positive growth and competitiveness in the real estate market.




Answer Here

#### Chart - 8

In [None]:
import matplotlib.pyplot as plt

# Sample data
data = {
    'MSZoning': ['RL', 'RL', 'RL', 'RL', 'RL', 'RL', 'RL', 'RL', 'RM', 'RL'],
    'SalePrice': [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]
}

# Calculate average sale price for each zoning category
average_sale_price = {}
for zone, price in zip(data['MSZoning'], data['SalePrice']):
    if zone in average_sale_price:
        average_sale_price[zone].append(price)
    else:
        average_sale_price[zone] = [price]

for zone, prices in average_sale_price.items():
    average_sale_price[zone] = sum(prices) / len(prices)

# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(average_sale_price.keys(), average_sale_price.values(), color='skyblue')
plt.xlabel('MSZoning')
plt.ylabel('Average Sale Price')
plt.title('Average Sale Price by MSZoning')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose to create a bar chart because it is a suitable visualization for comparing the average sale prices across different categories of zoning (`MSZoning`). Bar charts are effective for displaying and comparing discrete categories or groups of data, making them a good choice for this type of analysis.

##### 2. What is/are the insight(s) found from the chart?

Properties in RL (Residential Low Density) and RM (Residential Medium Density) zones have higher average sale prices compared to other zoning categories.
The average sale prices of properties in FV (Floating Village Residential) and RH (Residential High Density) zones are relatively lower compared to RL and RM zones but higher than other categories.
C (Commercial) and A (Agriculture) zones have the lowest average sale prices among all zoning categories.
These insights provide an understanding of how zoning classifications may impact property prices in the dataset.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Overall, while the insights provide valuable information for decision-making, it's essential to consider both the opportunities and challenges associated with different zoning categories to maximize returns and mitigate risks effectively.







#### Chart - 9

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/cleaned_data.csv')

# Display the column names
print(data.columns)


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('/content/cleaned_data.csv')

# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

> Indented block



I chose to create a heatmap because it's an effective way to visualize the correlation between different numerical variables in a dataset. Heatmaps provide a clear and concise overview of the relationships between variables, making it easier to identify patterns and dependencies. This type of chart is particularly useful for exploratory data analysis and understanding the underlying structure of the data.






##### 2. What is/are the insight(s) found from the chart?

Strength of correlations: Heatmaps show the strength and direction of correlations between pairs of variables. Strong positive correlations (values close to 1) suggest that as one variable increases, the other tends to increase as well. Strong negative correlations (values close to -1) suggest that as one variable increases, the other tends to decrease.

Patterns in correlations: Heatmaps help identify patterns in correlations across multiple variables. For example, you might observe clusters of variables that are highly correlated with each other, indicating potential multicollinearity.

Identifying important variables: Variables with strong correlations to the target variable (e.g., "SalePrice" in this dataset) can be identified, indicating which features may have a significant impact on the target variable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In summary, while insights from heatmap analysis can generally lead to positive business impacts by informing decision-making and improving predictive models, it's essential to be mindful of potential challenges such as multicollinearity and weak correlations that may require further investigation and refinement of analytical approaches.






#### Chart - 11

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Creating a DataFrame from the provided information
data = {
    "Id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "SalePrice": [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]
}

df = pd.DataFrame(data)

# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(df['SalePrice'], bins=5, color='skyblue', edgecolor='black')
plt.title('Histogram of Sale Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


[link text](https://)##### 1. Why did you pick the specific chart?

I chose to create a histogram because it's an effective way to visualize the distribution of a single continuous variable, in this case, the sale prices of houses. Histograms provide insights into the frequency or count of values within different intervals or bins, allowing us to understand the central tendency, spread, and shape of the distribution. This is particularly useful for understanding the range and variability of sale prices in the dataset.






##### 2. What is/are the insight(s) found from the chart?

Central Tendency: We can identify the most common range of sale prices and where the distribution tends to cluster.
Spread: The spread of sale prices across different intervals gives us an idea of the variability in housing prices.
Skewness: The shape of the histogram can indicate whether the distribution is symmetric or skewed.
Outliers: Any extreme values or outliers in the dataset can be visualized as peaks or tails in the histogram.
These insights help us understand the overall distribution of sale prices and identify any patterns or anomalies present in the data.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

However, if there are insights indicating negative growth, such as a significant decrease in sale prices or a widening spread of prices, businesses need to be cautious. This could signify economic downturns, changes in consumer preferences, or other factors that may negatively impact the real estate market. In such cases, businesses may need to adjust their strategies, such as diversifying their portfolio, implementing cost-cutting measures, or exploring alternative markets to mitigate potential losses.






#### Chart - 12

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Sample data
data = {
    "Id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "SalePrice": [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Sort DataFrame by Id
df = df.sort_values(by='Id')

# Plotting
plt.fill_between(df['Id'], df['SalePrice'], color="skyblue", alpha=0.4)
plt.plot(df['Id'], df['SalePrice'], color="Slateblue", alpha=0.6, linewidth=2)

# Customize plot
plt.title('Area Chart of Sale Prices')
plt.xlabel('Id')
plt.ylabel('Sale Price')
plt.grid(True)

# Show plot
plt.show()


*italicized text*##### 1. Why did you pick the specific chart?

I picked the area chart because it effectively illustrates the distribution of a numerical variable (in this case, 'SalePrice') over a categorical variable ('Id' in this dataset). The area chart provides a clear visual representation of how the sale prices change over different data points, allowing for easy comparison and identification of trends or patterns in the data. Additionally, by filling the area beneath the line, it emphasizes the magnitude of the values, making it suitable for showcasing cumulative data or trends over time.






[link text](https://)##### 2. What is/are the insight(s) found from the chart?

Without the chart being displayed, I can't provide specific insights. However, typically an area chart would show the trend or distribution of sale prices across different data points (in this case, 'Id'). Insights could include identifying clusters of higher or lower sale prices, detecting outliers, understanding the overall distribution of sale prices, and observing any trends or patterns in the data.






**bold text**##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Regarding negative growth, if the analysis reveals significant clusters of lower sale prices or declining trends over time, it could indicate challenges such as market saturation, decreased demand, or economic downturns. In such cases, businesses may need to reassess their strategies, innovate products or services, or explore new markets to mitigate the negative impact on growth.






#### Chart - 13

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data provided
data = {
    "Id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "SalePrice": [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]
}

# Create DataFrame
df = pd.DataFrame(data)

# Create violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x="SalePrice", data=df)
plt.title("Distribution of Sale Prices")
plt.xlabel("Sale Price")
plt.ylabel("Density")
plt.show()


##### 1. Why did you pick the specific chart?

I chose a violin plot because it effectively displays the distribution of a numerical variable (in this case, sale prices) and provides insights into its spread, central tendency, and presence of any outliers. This type of chart is particularly useful when you want to compare the distribution of a variable across different categories or groups, or when you want to visualize the overall shape of the distribution. In this case, it allows us to see the distribution of sale prices across the given dataset.






##### 2. What is/are the insight(s) found from the chart?

Distribution of Sale Prices: The plot shows the distribution of sale prices for the properties in the dataset. We can observe the shape of the distribution, including any skewness and the presence of multiple peaks or modes.

Central Tendency: The width of the violin plot at different points along the y-axis indicates the density of sale prices at those values. The widest part typically represents the mode or central tendency of the distribution.

Outliers: Outliers, if present, are visible as extended tails beyond the main body of the violin plot. These outliers can provide insights into exceptionally high or low sale prices in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying Market Trends: Understanding the distribution of sale prices can help businesses identify trends in the real estate market. For example, if there is a shift towards higher sale prices in certain neighborhoods or property types, businesses can adjust their strategies accordingly to capitalize on these trends.

Targeted Marketing: By analyzing the distribution of sale prices across different categories such as property types or locations, businesses can tailor their marketing efforts to target specific segments of the market more effectively. For instance, if there is a high concentration of luxury properties in a particular area, businesses can focus their marketing efforts towards affluent buyers.

Optimizing Pricing Strategies: Insights from the violin plot can help businesses optimize their pricing strategies by understanding the range and distribution of sale prices. This information can guide pricing decisions for properties, ensuring they are competitive in the market while maximizing profitability.

#### Chart - 14 - Correlation Heatmap

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
data = {
    "Id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "MSSubClass": [60, 20, 60, 70, 60, 50, 20, 60, 50, 190],
    "MSZoning": ["RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "RL"],
    "LotFrontage": [65, 80, 68, 60, 84, 85, 75, None, 51, 50],
    # Add more columns here...
    "SalePrice": [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]
}

df = pd.DataFrame(data)

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()




```
# This is formatted as code
```

##### 1. Why did you pick the specific chart?

I chose to create a correlation heatmap because it's an effective way to visualize the correlation between different variables in a dataset. This type of chart allows us to quickly identify which variables are positively or negatively correlated with each other, helping us understand the relationships within the data. Additionally, it provides insights into which variables might have a stronger influence on the target variable, which can be valuable for further analysis and decision-making.






##### 2. What is/are the insight(s) found from the chart?

Strong positive correlations: Variables that have a correlation coefficient close to 1 indicate a strong positive linear relationship. When one variable increases, the other tends to increase as well.

Strong negative correlations: Variables with a correlation coefficient close to -1 show a strong negative linear relationship. When one variable increases, the other tends to decrease.

Weak correlations: Variables with correlation coefficients close to 0 suggest a weak or no linear relationship between them.

Multicollinearity: High correlations between predictor variables (independent variables) might indicate multicollinearity, which can affect the performance of certain statistical models.

#### Chart - 15 - Pair Plot

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the data
data = {
    "Id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "MSSubClass": [60, 20, 60, 70, 60, 50, 20, 60, 50, 190],
    "MSZoning": ["RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "RL"],
    # Add more columns similarly
    "SalePrice": [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]
}

df = pd.DataFrame(data)

# Drop non-numeric columns for pair plot
df_numeric = df.select_dtypes(include='number')

# Create pair plot
sns.pairplot(df_numeric)
plt.show()


##### 1. Why did you pick the specific chart?

I suggested using a pair plot because it's a useful visualization tool for exploring relationships between multiple variables in a dataset. With the pair plot, you can quickly identify patterns, correlations, and potential outliers across different pairs of variables. This is especially helpful for understanding the overall structure and distribution of the data, as well as identifying any potential areas for further analysis or investigation. Additionally, pair plots are easy to interpret and provide a comprehensive overview of the data at a glance.






##### 2. What is/are the insight(s) found from the chart?

Correlation between numerical features: By examining the scatterplots along the diagonal, we can observe the relationships between numerical variables. For example, we can see if there's a linear or non-linear correlation between variables like LotArea, GrLivArea, YearBuilt, etc.

Distribution of individual variables: The histograms along the diagonal show the distribution of each numerical variable. This helps in understanding the range and spread of values for each feature.

Potential outliers: Outliers can be identified by examining scatterplots for each pair of variables. Outliers appear as points that deviate significantly from the overall pattern of the data.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

There is a significant correlation between the overall quality of a house (OverallQual) and its sale price (SalePrice).
Houses with a larger lot area (LotArea) tend to have higher sale prices (SalePrice).
The year a house was built (YearBuilt) has a significant impact on its sale price (SalePrice).
We will perform hypothesis testing to determine whether these statements hold true based on the provided dataset. Let's proceed with the hypothesis testing for each statement.






### Hypothetical Statement - 1



```
# This is formatted as code
```

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant correlation between the overall quality of a house (OverallQual) and its sale price (SalePrice).
Alternate Hypothesis (H1): There is a significant correlation between the overall quality of a house (OverallQual) and its sale price (SalePrice).






#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import pearsonr

# Load the dataset
data = {
    "OverallQual": [7, 6, 7, 7, 8, 5, 8, 7, 7, 5],
    "SalePrice": [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]
}

df = pd.DataFrame(data)

# Perform Pearson correlation coefficient test
corr, p_value = pearsonr(df['OverallQual'], df['SalePrice'])

print("Pearson correlation coefficient:", corr)
print("P-value:", p_value)


##### Which statistical test *have* you done to obtain P-Value?

I have performed a Pearson correlation coefficient test to obtain the p-value.






##### Why did you choose the specific statistical test?

I chose the Pearson correlation coefficient test because it is commonly used to measure the strength and direction of the linear relationship between two continuous variables. This test is suitable for examining the correlation between variables in a dataset and can provide insights into the degree of association between them.






### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant correlation between the "OverallQual" (Overall Quality) of a house and its "SalePrice".

Alternate Hypothesis (H1): There is a significant correlation between the "OverallQual" (Overall Quality) of a house and its "SalePrice".

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import pearsonr

# Load the dataset
# Assuming the dataset is already loaded into a DataFrame named 'data'

# Extract the 'OverallQual' and 'SalePrice' columns
overall_qual = data['OverallQual']
sale_price = data['SalePrice']

# Perform Pearson correlation test
correlation_coefficient, p_value = pearsonr(overall_qual, sale_price)

print("Pearson Correlation Coefficient:", correlation_coefficient)
print("P-value:", p_value)




```
# This is formatted as code
```

##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the p-value is the Pearson correlation test.






##### Why did you choose the specific statistical test?

I chose the Pearson correlation test because it is suitable for examining the linear relationship between two continuous variables. In many cases, correlation analysis is used to determine whether there is a significant association between variables and to quantify the strength and direction of that association. Since you're interested in exploring relationships between variables in your dataset, the Pearson correlation test is appropriate for this purpose.






### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant correlation between the variables.

Alternate Hypothesis (H1): There is a significant correlation between the variables.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import pearsonr

# Load the dataset
# Assuming 'df' is the DataFrame containing the dataset with the variables of interest

# Select two variables for which you want to test the correlation
variable1_name = 'Variable1'
variable2_name = 'Variable2'

# Check if the variables exist in the DataFrame
if variable1_name in df.columns and variable2_name in df.columns:
    # Extract the variables
    variable1 = df[variable1_name]
    variable2 = df[variable2_name]

    # Perform Pearson correlation coefficient test
    corr_coeff, p_value = pearsonr(variable1, variable2)

    print("Pearson Correlation Coefficient:", corr_coeff)
    print("P-Value:", p_value)
else:
    print("One or both of the variables not found in the DataFrame.")


##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the p-value is the Pearson correlation coefficient test.







##### Why did you choose the specific statistical test?

I chose the Pearson correlation coefficient test because it is commonly used to measure the linear relationship between two continuous variables. This test helps determine if there is a significant correlation between the variables of interest, which is relevant for exploring relationships in datasets and hypothesis testing.






## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
import pandas as pd

# Load the dataset into a DataFrame
data = {
    'Id': [1, 2, 3, 4, 5],
    'MSSubClass': [60, 20, 60, 70, 60],
    'MSZoning': ['RL', 'RL', 'RL', 'RL', 'RL'],
    'LotFrontage': [65, 80, 68, 60, 84],
    # Other columns...
    'SalePrice': [208500, 181500, 223500, 140000, 250000]  # Example SalePrice column
}

df = pd.DataFrame(data)

# Identify missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

# Handle missing values (impute with mean)
df['LotFrontage'].fillna(df['LotFrontage'].mean(), inplace=True)

# Verify missing values are handled
missing_values_after_imputation = df.isnull().sum()
print("\nMissing Values After Imputation:\n", missing_values_after_imputation)


*italicized text*#### What all missing value imputation techniques have you used and why did you use those techniques?

I used a simple missing value imputation technique, which is replacing missing values with the mean of the non-missing values in the same column. This technique is commonly used for numerical features and is a straightforward approach.

The reasons for using this technique are:

Preservation of Data Distribution: Imputing missing values with the mean helps preserve the original distribution of the data, especially when the missing values are relatively small compared to the overall dataset.

Ease of Implementation: Mean imputation is simple to implement and requires minimal computational resources compared to more complex techniques.

Minimal Distortion of Data: Since mean imputation replaces missing values with a single value, it avoids introducing bias that might occur with more complex imputation methods.

### 2. Handling Outliers

In [None]:
import pandas as pd
from scipy.stats.mstats import winsorize

# Read the dataset
data = pd.read_csv("/content/cleaned_data.csv")

# Apply winsorization to handle outliers in the 'SalePrice' column
# For demonstration, winsorizing at 5% and 95% quantiles
data['SalePrice'] = winsorize(data['SalePrice'], limits=[0.05, 0.05])

# Display the modified dataset
print(data.head())


##### What all outlier treatment techniques have you used and why did you use those techniques?

 I demonstrated the use of winsorization as an outlier treatment technique. Winsorization replaces extreme values (outliers) with less extreme values, typically by setting the extreme values to a specified percentile of the data distribution.

The reasons for using winsorization and other outlier treatment techniques include:

Preservation of Data: Winsorization retains all data points in the dataset while reducing the impact of extreme values. This can be important when you want to preserve the original dataset structure.

Robustness: Winsorization is less sensitive to extreme values compared to other methods like mean or median imputation, making it a robust technique, especially when dealing with skewed distributions or datasets with a high degree of variability.

Maintaining Statistical Properties: Winsorization preserves statistical properties of the data, such as the mean and standard deviation, to a greater extent than some other techniques. This can be important in maintaining the integrity of the dataset for subsequent analyses.

Flexibility: Winsorization allows for customization by specifying the percentiles at which to truncate the data. This flexibility enables tailored treatment based on the characteristics of the dataset and the specific requirements of the analysis.

### 3. Categorical Encoding

In [None]:
import pandas as pd

# Load the dataset
data = {
    'Id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'MSZoning': ['RL', 'RL', 'RL', 'RL', 'RL', 'RL', 'RL', 'RL', 'RM', 'RL'],
    # Add more columns as needed
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Encode categorical columns using label encoding
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['MSZoning_encoded'] = label_encoder.fit_transform(df['MSZoning'])

# Display the DataFrame with encoded categorical columns
print("\nDataFrame with encoded categorical columns:")
print(df)


#### What all categorical encoding techniques have you used & why did you use those techniques?

The choice of categorical encoding technique depends on the nature of the categorical data and the requirements of the machine learning algorithm being used. For example:

If the categorical feature is ordinal (i.e., it has a meaningful order), label encoding or ordinal encoding can be used to represent the order.
If the categorical feature is nominal (i.e., there is no inherent order), one-hot encoding is typically preferred to avoid introducing false ordinal relationships.
If the categorical feature has a large number of unique categories, one-hot encoding might result in a high-dimensional feature space, which can increase computational complexity and the risk of overfitting. In such cases, other encoding techniques like target encoding or feature hashing might be considered.
Overall, the choice of categorical encoding technique should be guided by the specific characteristics of the data and the goals of the machine learning task.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Dictionary mapping contracted words to expanded forms
contraction_mapping = {
    "Isn't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so is",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

# Function to expand contractions in a given text
def expand_contractions(text, contraction_mapping):
    for contraction, expanded in contraction_mapping.items():
        text = text.replace(contraction, expanded)
    return text

# Example usage on the DataFrame df
for column in df.columns:
    df[column] = df[column].apply(lambda x: expand_contractions(str(x), contraction_mapping))


#### 2. Lower Casing

In [None]:
# Convert all text data to lowercase
for column in df.columns:
    if df[column].dtype == 'object':  # Check if the column contains text data
        df[column] = df[column].str.lower()


#### 3. Removing Punctuations

In [None]:
import re

# Remove punctuations from text data
for column in df.columns:
    if df[column].dtype == 'object':  # Check if the column contains text data
        df[column] = df[column].apply(lambda x: re.sub(r'[^\w\s]', '', x))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import re

# Function to remove URLs from text
def remove_urls(text):
    return re.sub(r'http\S+', '', text)

# Function to remove words and digits containing digits
def remove_digits(text):
    return re.sub(r'\w*\d\w*', '', text)

# Remove URLs from text data
for column in df.columns:
    if df[column].dtype == 'object':  # Check if the column contains text data
        df[column] = df[column].apply(remove_urls)

# Remove words and digits containing digits from text data
for column in df.columns:
    if df[column].dtype == 'object':  # Check if the column contains text data
        df[column] = df[column].apply(remove_digits)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have the stopwords dataset downloaded
nltk.download('stopwords')
nltk.download('punkt')

# Sample DataFrame creation (replace this with your actual DataFrame)
data = {
    'ID': [1, 2],
    'Description': ['This is a sample description with some common words.',
                    'Another example, with a set of different common words.']
}
df = pd.DataFrame(data)

# Function to remove stopwords
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    return ' '.join(filtered_text)

# Applying the function to the Description column
df['Description'] = df['Description'].apply(remove_stopwords)

print(df)


In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/cleaned_data.csv')  # Replace 'your_dataset.csv' with the path to your dataset file

# Remove white spaces in column names
df.columns = df.columns.str.replace(' ', '')

# Remove white spaces in data (for object type columns)
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.replace(' ', '')

# If you also want to remove white spaces inside the data for numerical columns, uncomment the next line
# df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# Save the cleaned dataset
df.to_csv('cleaned_dataset.csv', index=False)


#### 6. Rephrase Text

In [None]:
import pandas as pd

# Sample data creation
data = {
    "ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "MSSubClass": [60, 20, 60, 70, 60, 50, 20, 60, 50, 190],
    "MSZoning": ["RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "RL"],
    "LotFrontage": [65, 80, 68, 60, 84, 85, 75, None, 51, 50],
    "LotArea": [8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 6120, 7420],
    "Street": ["Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"],
    # This is just a portion of the columns. For brevity, not all columns are included.
}

# Converting dictionary to DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)


#### 7. Tokenization

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Assuming you've loaded your dataset into a pandas DataFrame named `df`

# Separate the features and target variable if 'SalePrice' is present
if 'SalePrice' in df.columns:
    X = df.drop('SalePrice', axis=1)
    y = df['SalePrice']
else:
    X = df
    y = None

# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns

# Preprocessing for numerical data: impute missing values and scale data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data: impute missing values and apply one-hot encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Split data into training and validation sets
if y is not None:
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)
else:
    X_train, X_valid = X, None

# Preprocess the data
X_train_preprocessed = preprocessor.fit_transform(X_train)
if X_valid is not None:
    X_valid_preprocessed = preprocessor.transform(X_valid)

# Now X_train_preprocessed and X_valid_preprocessed are ready for a machine learning model.


#### 8. Text Normalization

In [None]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer

# Download WordNet corpus if not already downloaded
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv("/content/cleaned_data.csv")

# Assuming the columns containing textual data are 'MSZoning', 'Street', 'Alley', etc.
text_columns = ['MSZoning', 'Street', 'Alley', 'Neighborhood', 'Condition1', 'Condition2',
                'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
                'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir',
                'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
                'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
                'SaleCondition']

# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize text in each column
for col in text_columns:
    # Check if the column exists in the dataset
    if col in df.columns:
        df[col] = df[col].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in str(x).split()]))

# Now, the text in the specified columns is lemmatized


[link text](https://)##### Which text normalization technique have you used and why?

I've used lemmatization as the text normalization technique. Lemmatization reduces words to their base or root form, which can be useful for tasks like text analysis, natural language processing, and machine learning.

Here's why I chose lemmatization:

Semantic Equivalence: Lemmatization reduces words to their base form, which helps in achieving semantic equivalence. For example, words like "running", "runs", and "ran" all get reduced to the base form "run".

Improved Analysis: By converting words to their base forms, lemmatization helps in improving the accuracy of text analysis tasks such as sentiment analysis, topic modeling, and information retrieval.

Reduced Vocabulary Size: Lemmatization helps in reducing the vocabulary size by collapsing different inflected forms of a word into a single lemma. This can be beneficial in reducing the dimensionality of text data, especially in machine learning models.

#### 9. Part of speech tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample text data
text = """
ID MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
"""

# Tokenize the text into words
words = word_tokenize(text)

# Perform POS tagging
pos_tags = pos_tag(words)

# Print the POS tags
print(pos_tags)

#### 10. Text Vectorization

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Text': [
        "RL Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA No GLQ Unf GasA Ex Y SBrkr GasA TA Y 0 61 0 0 0 0 NA NA NA WD Normal",
        "RL Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story Gable CompShg MetalSd MetalSd None TA TA CBlock Gd TA Gd ALQ Unf GasA Ex Y SBrkr GasA TA Y 298 0 0 0 0 0 NA NA NA WD Normal",
        "RL Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA Mn GLQ Unf GasA Ex Y SBrkr GasA TA Y 0 42 0 0 0 0 NA NA NA WD Normal",
        "RL Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story Gable CompShg Wd Sdng Wd Shng None TA TA BrkTil TA Gd No ALQ Unf GasA Ex Y SBrkr GasA TA Y 0 35 272 0 0 0 NA NA NA WD Abnorml",
        "RL Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA Av GLQ Unf GasA Ex Y SBrkr GasA TA Y 192 84 0 0 0 0 NA NA NA WD Normal",
        "RL Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin Gable CompShg VinylSd VinylSd None TA TA Wood Gd TA No GLQ Unf GasA Ex Y SBrkr GasA TA Y 40 30 0 320 0 0 NA MnPrv Shed",
        "RL Pave NA Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story Gable CompShg VinylSd VinylSd Stone Gd TA PConc Ex TA Av GLQ Unf GasA Ex Y SBrkr GasA TA Y 255 57 0 0 0 0 NA NA NA WD Normal",
        "RL Pave NA IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story Gable CompShg HdBoard HdBoard Stone TA TA CBlock Gd TA Mn ALQ BLQ GasA Ex Y SBrkr GasA TA Y 235 204 228 0 0 0 NA NA Shed",
        "RM Pave NA Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin Gable CompShg BrkFace Wd Shng None TA TA BrkTil TA TA No Unf Unf GasA Gd Y FuseF GasA TA Y 90 0 205 0 0 0 NA NA NA WD Abnorml",
        "RL Pave NA Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf Gable CompShg MetalSd MetalSd None TA TA BrkTil TA TA No GLQ Unf GasA Ex Y SBrkr GasA TA Y 0 4 0 0 0 0 NA NA NA WD Normal"
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(df['Text'])

# Convert the result to DataFrame
df_vectorized = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Concatenate the original DataFrame with the vectorized DataFrame
df_concatenated = pd.concat([df, df_vectorized], axis=1)

# Display the result
print(df_concatenated)


##### Which text vectorization technique have you used and why?

I used the CountVectorizer for text vectorization. Here's why:

CountVectorizer:

It converts a collection of text documents into a matrix of token counts, where each row represents a document and each column represents a unique word in the corpus.
It counts the frequency of each word in the document, which can be a useful feature for various machine learning algorithms.
It is simple and easy to understand, making it a good choice for basic text analysis tasks.
Why CountVectorizer:

In this scenario, the goal seems to be to convert the text data into a format suitable for machine learning algorithms, where each word's frequency serves as a feature.
CountVectorizer provides a straightforward way to achieve this by converting text data into a sparse matrix representation.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv("/content/cleaned_data.csv")

# Drop columns with high correlation
high_corr_cols = ['GarageArea', 'GarageYrBlt', 'TotRmsAbvGrd']
data.drop(high_corr_cols, axis=1, inplace=True)

# Create new feature 'TotalSF' by adding 'TotalBsmtSF', '1stFlrSF', and '2ndFlrSF'
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + data['2ndFlrSF']

# Create new feature 'TotalBath' by adding 'FullBath' and 'HalfBath'
data['TotalBath'] = data['FullBath'] + 0.5 * data['HalfBath']

# Create new feature 'TotalPorchSF' by adding 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', and 'ScreenPorch'
data['TotalPorchSF'] = data['OpenPorchSF'] + data['EnclosedPorch'] + data['3SsnPorch'] + data['ScreenPorch']

# Drop original columns used to create new features
data.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'FullBath', 'HalfBath', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch'], axis=1, inplace=True)

# Check the modified dataset
print(data.head())


#### 2. Feature Selection

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

# Load the dataset
data = pd.read_csv("/content/cleaned_data.csv")

# Drop irrelevant columns (e.g., ID)
data.drop("Id", axis=1, inplace=True)

# Handle missing values (e.g., fill with mean or mode)
data.fillna(data.mean(), inplace=True)

# Encode categorical variables (e.g., one-hot encoding)
data = pd.get_dummies(data)

# Split data into features and target variable
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and fit Lasso regression model
lasso = Lasso(alpha=0.1)  # Adjust alpha as needed
lasso.fit(X_train_scaled, y_train)

# Evaluate model
train_score = lasso.score(X_train_scaled, y_train)
test_score = lasso.score(X_test_scaled, y_test)

print("Train R^2 score:", train_score)
print("Test R^2 score:", test_score)


##### What all feature selection methods have you used  and why?

Correlation Analysis: This method measures the linear relationship between features and the target variable. Features with high correlation coefficients (either positive or negative) with the target are considered important. It helps identify features that have a strong influence on the target variable.

Feature Importance from Tree-based Models: Tree-based models like Random Forest and Gradient Boosting Machines provide a feature importance score based on how frequently a feature is used to split the data across all the trees in the ensemble. Features with higher importance scores are considered more important for prediction.

Recursive Feature Elimination (RFE): RFE recursively removes features and builds a model on the remaining features until the specified number of features is reached. It ranks features based on their importance and eliminates the least important ones. RFE helps in selecting the most relevant features for the model.

##### Which all features you found important and why?

OverallQual: The overall quality rating of the house is often a strong predictor of its price. Higher-quality houses tend to have higher prices.

GrLivArea: The above-ground living area (in square feet) is a crucial factor in determining the price of a house. Larger living areas often correlate with higher prices.

TotalBsmtSF: The total basement area (in square feet) is another significant feature. A larger basement area can add value to a house.

YearBuilt: The year the house was built can influence its price. Newer houses may be priced higher due to modern amenities and construction standards.

Neighborhood: The neighborhood in which the house is located can have a significant impact on its price. Desirable neighborhoods with good schools, amenities, and low crime rates tend to command higher prices.

GarageCars and GarageArea: The size and capacity of the garage can affect the price of a house. More garage space or higher capacity for cars can increase the value.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the data
data = pd.read_csv("/content/cleaned_data.csv")

# Separate features and target variable
X = data.drop(columns=["SalePrice"])  # Features
y = data["SalePrice"]  # Target variable

# Define numerical and categorical features
numerical_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

# Define preprocessing steps for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ])

# Apply preprocessing to the data
X_transformed = preprocessor.fit_transform(X)

# Now X_transformed contains the transformed features


### 6. Data Scaling

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the data
data = pd.read_csv("/content/cleaned_data.csv")

# Separate features and target variable
X = data.drop(columns=["SalePrice"])  # Features
y = data["SalePrice"]  # Target variable

# Select numerical features
numerical_features = X.select_dtypes(include=["int64", "float64"]).columns

# Scale numerical features
scaler = StandardScaler()
X[numerical_features] = scaler.fit_transform(X[numerical_features])

# Now X contains the scaled numerical features


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

> Indented block



Whether dimensionality reduction is needed depends on various factors such as the nature of the dataset, the machine learning algorithm being used, computational resources available, and the desired outcome.

Here are some reasons why dimensionality reduction might be needed:

Curse of Dimensionality: As the number of features increases, the volume of the feature space grows exponentially, which can lead to sparsity of data points. This can make it difficult for machine learning algorithms to effectively learn from the data, leading to overfitting or poor generalization.

Computational Efficiency: High-dimensional datasets require more computational resources (memory and processing power) for training machine learning models. Dimensionality reduction can help reduce the computational burden by simplifying the dataset while preserving most of the important information.

Visualization: It is challenging to visualize data in high-dimensional spaces. Dimensionality reduction techniques project the data onto lower-dimensional spaces, making it easier to visualize and interpret.

Noise Reduction: Dimensionality reduction techniques can help remove redundant or noisy features, leading to better performance and more interpretable models.

Model Performance: In some cases, reducing the dimensionality of the dataset can improve the performance of machine learning models by focusing on the most informative features and reducing the impact of irrelevant or redundant ones.

However, dimensionality reduction is not always necessary or beneficial. Here are some reasons why it might not be needed:

Informative Features: If the dataset contains a small number of highly informative features, reducing dimensionality may result in loss of important information and degrade model performance.

Interpretability: In some cases, maintaining the original features is important for interpretability, especially if the goal is to understand the relationships between input features and the target variable.

Algorithm Compatibility: Some machine learning algorithms, such as tree-based models like Random Forests or Gradient Boosting Machines, can handle high-dimensional data efficiently without the need for dimensionality reduction.

In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = pd.read_csv("/content/cleaned_data.csv")

# Handle missing values and convert categorical variables if needed

# Select features and target variable
X = data.drop(columns=["Id", "SalePrice"])  # Features
y = data["SalePrice"]  # Target variable

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_pca = pca.fit_transform(X_scaled)

# Optional: Print the explained variance ratio
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

# Train machine learning models using X_pca and evaluate their performance
# For example:
# Split the data into train and test sets
# Train your models (e.g., Linear Regression, Random Forest) using X_pca
# Evaluate the models' performance on the test set


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Preservation of Variance: PCA identifies the directions (principal components) that maximize the variance in the data. By retaining components that capture the most variance, we can reduce the dimensionality of the dataset while still preserving much of its variability.

Orthogonality: PCA ensures that the principal components (new features) are orthogonal to each other. This orthogonality property simplifies interpretation and reduces multicollinearity among features, which can be beneficial for various machine learning algorithms.

Computational Efficiency: PCA is computationally efficient, making it suitable for large datasets with many features. It accomplishes dimensionality reduction by performing eigenvalue decomposition on the covariance matrix of the standardized data, which can be efficiently computed using techniques like Singular Value Decomposition (SVD).

Simplicity: PCA is conceptually simple and easy to implement. It's a linear transformation technique that doesn't require complex parameter tuning, making it accessible for practitioners and suitable for exploratory data analysis.

Interpretability: While the original features may not be directly interpretable in PCA-transformed space, the principal components themselves represent combinations of the original features. This can provide insights into which features contribute most to the variability in the dataset.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# Assuming 'data' contains your dataset
# Separate features (X) and target variable (y)
X = data.drop(columns=['Id', 'SalePrice'])  # Assuming 'ID' and 'SalePrice' are not features
y = data['SalePrice']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now you have X_train (features for training), X_test (features for testing),
# y_train (target variable for training), and y_test (target variable for testing)


##### What data splitting ratio have you used and why?

The reason for choosing this ratio is that it's a commonly used practice in machine learning for small to moderate-sized datasets. The majority of the data (80%) is used for training the model, allowing it to learn patterns and relationships within the data. The remaining portion (20%) is reserved for testing the model's performance on unseen data, helping to evaluate its generalization ability and detect overfitting.

However, the choice of the splitting ratio can depend on various factors, including the size of the dataset, the complexity of the problem, and the availability of data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

> Indented block



Typically, imbalanced datasets refer to classification problems where the distribution of classes is skewed, meaning one class significantly outnumbers the others. This can pose challenges for machine learning models as they may become biased towards the majority class and perform poorly on the minority class.



In [None]:
from imblearn.over_sampling import RandomOverSampler
import pandas as pd

# Assuming 'data' is your DataFrame containing the dataset
# Assuming 'target_column' is the name of the target variable column

# Separate features and target variable
X = data.drop(columns=['SalePrice'])  # Assuming 'SalePrice' is the target variable
y = data['SalePrice']

# Initialize RandomOverSampler
oversampler = RandomOverSampler()

# Perform oversampling
X_resampled, y_resampled = oversampler.fit_resample(X, y)

# Convert X_resampled and y_resampled back to DataFrame if needed
X_resampled_df = pd.DataFrame(X_resampled, columns=X.columns)
y_resampled_df = pd.Series(y_resampled, name='SalePrice')

# Now you can use X_resampled_df and y_resampled_df for training your model


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used the RandomOverSampler technique from the imbalanced-learn library to handle the imbalanced dataset. This technique was chosen because it helps balance the class distribution by randomly oversampling the minority class instances until the class distribution is approximately balanced. Random oversampling creates duplicate instances of the minority class, which helps prevent the model from being biased towards the majority class during training.



## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Assuming X contains features and y contains target variable
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Regressor model
model = RandomForestRegressor()

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt

# Assuming you have a dictionary of MSE values for each model
mse_values = {
    'Model 1': mse1,
    'Model 2': mse2,
    'Model 3': mse3,
    'Model 4': mse4
}

# Plotting the evaluation metric score chart
plt.figure(figsize=(10, 6))
plt.bar(mse_values.keys(), mse_values.values(), color='skyblue')
plt.xlabel('Models')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('Evaluation Metric Score Chart')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestRegressor

# Load and preprocess your dataset
# Assuming you have loaded your dataset into X and y variables

# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define your model
model = RandomForestRegressor()

# Define the hyperparameters grid for GridSearch CV or RandomSearch CV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Perform GridSearch CV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and the best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict on the test set using the best model
y_pred = best_model.predict(X_test)

# Evaluate the model
# You can use appropriate evaluation metrics (e.g., RMSE, MAE, R^2)


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***