# This is an outline of what the capstone project should look like. This can vary as per you choice of project, dataset or industry.

Data Analysis Project Guideline
1. Identify a problem to solve

Choose a real-world problem or a hypothetical one
Clearly state the problem and the question you want to answer with your analysis

2. Collect and clean data

Collect data from various sources such as public data repositories or APIs 
Clean and preprocess the data using the techniques you learned in the bootcamp such as handling missing values, removing duplicates, and converting data types.

3. Perform exploratory data analysis

Use descriptive statistics and visualization techniques to gain insights into the data
Identify trends, patterns, and outliers in the data

4. Apply statistical analysis (Optional)

Apply statistical techniques to analyze the data and test your hypothesis
Use techniques such as hypothesis testing, regression analysis, or clustering.

5. Visualize the results

Create compelling visualizations to communicate your findings
Use various libraries such as Matplotlib, Seaborn, or Plotly to create visualizations.

6. Draw conclusions

Summarize your findings and draw conclusions based on your analysis
Answer the question you posed at the beginning of the project and explain how your analysis supports your conclusion

7. Document the process

Document your entire data analysis process including the problem you addressed, data sources, data cleaning and preprocessing, exploratory analysis, statistical analysis, and visualizations
This will help you showcase your skills to potential employers and colleagues.

8. Share the project on GitHub

Share your project on GitHub so that others can learn from your work
This will also serve as a portfolio for you that you can use to showcase your skills to potential employers.

1. Identify a problem to solve 
- Why has there been such an increase in housing prices?
- The problem is that due to things as increase for production cost ie. raw materials, supply and demand, geopolitical     instability there has been an icrease of demand for houses leading owners to increase prices.
- We are going to be looking at analysing cost effective house at reasonable prices depending on income 

1. What are the current trends in housing prices in our region?
2. How have they evolved over the past decade?
3. What is the median income in our area, and how does it compare to the median home price?
4. What impact has population growth, both natural and through migration, had on the demand for housing?
5. Is there a correlation between population growth and housing price inflation?
6. What is the area code with the most affordable houses?

In [None]:
Please have a look at the cheat sheet to understand more



```

.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)


#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+





```

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re

In [5]:
file_path = r"C:\Users\Emmanuel\Downloads\archive (2)\Average-prices-2020-10.csv"
data = pd.read_csv(file_path)

In [6]:
data

Unnamed: 0,Date,Region_Name,Area_Code,Average_Price,Monthly_Change,Annual_Change,Average_Price_SA
0,1968-04-01,Wales,W92000004,2885.414162,0.000000,,
1,1968-04-01,Scotland,S92000003,2844.980688,0.000000,,
2,1968-04-01,Northern Ireland,N92000001,3661.485500,0.000000,,
3,1968-04-01,England,E92000001,3408.108064,0.000000,,
4,1968-04-01,Yorkshire and The Humber,E12000003,2712.015577,0.000000,,
...,...,...,...,...,...,...,...
132030,2020-10-01,Northumberland,E06000057,167042.729600,2.158501,8.206148,
132031,2020-10-01,Bournemouth Christchurch and Poole,E06000058,292394.341300,-0.452157,4.051890,
132032,2020-10-01,England and Wales,K04000001,257321.109800,0.683393,5.385490,254285.4262
132033,2020-10-01,Great Britain,K03000001,248429.816900,0.756134,5.454201,244860.2508


# Data Cleaning

In [99]:
import pandas as pd
import re

In [100]:
df = data.drop_duplicates()
df

Unnamed: 0,Date,Region_Name,Area_Code,Average_Price,Monthly_Change,Annual_Change,Average_Price_SA
0,1968-04-01,Wales,W92000004,2885.414162,0.000000,,
1,1968-04-01,Scotland,S92000003,2844.980688,0.000000,,
2,1968-04-01,Northern Ireland,N92000001,3661.485500,0.000000,,
3,1968-04-01,England,E92000001,3408.108064,0.000000,,
4,1968-04-01,Yorkshire and The Humber,E12000003,2712.015577,0.000000,,
...,...,...,...,...,...,...,...
132030,2020-10-01,Northumberland,E06000057,167042.729600,2.158501,8.206148,
132031,2020-10-01,Bournemouth Christchurch and Poole,E06000058,292394.341300,-0.452157,4.051890,
132032,2020-10-01,England and Wales,K04000001,257321.109800,0.683393,5.385490,254285.4262
132033,2020-10-01,Great Britain,K03000001,248429.816900,0.756134,5.454201,244860.2508


In [101]:
df = df.drop(columns = "Annual_Change")
df = df.drop(columns = "Monthly_Change")
df 

Unnamed: 0,Date,Region_Name,Area_Code,Average_Price,Average_Price_SA
0,1968-04-01,Wales,W92000004,2885.414162,
1,1968-04-01,Scotland,S92000003,2844.980688,
2,1968-04-01,Northern Ireland,N92000001,3661.485500,
3,1968-04-01,England,E92000001,3408.108064,
4,1968-04-01,Yorkshire and The Humber,E12000003,2712.015577,
...,...,...,...,...,...
132030,2020-10-01,Northumberland,E06000057,167042.729600,
132031,2020-10-01,Bournemouth Christchurch and Poole,E06000058,292394.341300,
132032,2020-10-01,England and Wales,K04000001,257321.109800,254285.4262
132033,2020-10-01,Great Britain,K03000001,248429.816900,244860.2508


In [102]:
df["Average_Price_SA"] = df["Average_Price_SA"].astype(str).str.strip("Nan")
df["Date"] = df["Date"].astype(str).str.strip("\d{1968")
df

Unnamed: 0,Date,Region_Name,Area_Code,Average_Price,Average_Price_SA
0,-04-0,Wales,W92000004,2885.414162,
1,-04-0,Scotland,S92000003,2844.980688,
2,-04-0,Northern Ireland,N92000001,3661.485500,
3,-04-0,England,E92000001,3408.108064,
4,-04-0,Yorkshire and The Humber,E12000003,2712.015577,
...,...,...,...,...,...
132030,2020-10-0,Northumberland,E06000057,167042.729600,
132031,2020-10-0,Bournemouth Christchurch and Poole,E06000058,292394.341300,
132032,2020-10-0,England and Wales,K04000001,257321.109800,254285.4262
132033,2020-10-0,Great Britain,K03000001,248429.816900,244860.2508


In [3]:
import pandas as pd
df = pd.DataFrame(df)
df['Date'] = pd.to_numeric(df['Date'], errors='coerce')

# Define the range of years to keep
start_year = 1968
end_year = 2010

# Iterate through columns and drop those within the specified year range
for year in range(start_year, end_year + 1):
    column_name = str(year)
    if column_name in df.columns:
        df.drop(column_name, axis=1, inplace=True)

NameError: name 'df' is not defined