# Data Cleaning and Data Wrangling

Data wrangling is the process of transforming and structuring data from one raw form into a desired format with the intent of improving data quality and making it more consumable and useful for analytics or machine learning.

Data wrangling prepares your data for the data mining process, which is the stage of analysis when you look for patterns or relationships in your dataset that can guide actionable insights.

Your data analysis can only be as good as the data itself. If you analyze bad data, it's likely that you'll draw ill-informed conclusions and won't be able to make reliable, data-informed decisions.

With wrangled data, you can feel more confident in the conclusions you draw from your data. You'll get results much faster, with less chance of errors or missed opportunities.

In [3]:
import pandas as pd

### Content
- Identifying and selectiong the information of interest.
- Non-standar values in categorical columns.
- Cleaning missing values.
- 

In [5]:
# loading the data
housing = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data Science - Study notes\Data_used\housing.csv")

In [9]:
pd.set_option('display.max_columns', None)

In [10]:
housing.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


### Identifying and selectiong the information of interest.

To get the info of interest we drop the columns in the dataset as:

In [8]:
housing.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

Then, we list the columns we desire:

In [11]:
housing = housing[['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity']]

In this case, we selected the whole dataset. If a column is not of insteres, just left it out of the list.

### Non-standar values in categorical columns.

To do this, we first identify the type of data in each column as follow:

In [12]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [15]:
# Then, from our dataset
housing['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

From here on, we confirm there are all regular values in our column 'ocean_proximity'.

### Cleaning missing values.

The first element we must check are the null values. We do it as follows:

In [18]:
housing.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

We can get filter them as follows:

In [24]:
filter = housing['total_bedrooms'].isna()
housing[filter].head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
290,-122.16,37.77,47.0,1256.0,,570.0,218.0,4.375,161900.0,NEAR BAY
341,-122.17,37.75,38.0,992.0,,732.0,259.0,1.6196,85100.0,NEAR BAY
538,-122.28,37.78,29.0,5154.0,,3741.0,1273.0,2.5762,173400.0,NEAR BAY
563,-122.24,37.75,45.0,891.0,,384.0,146.0,4.9489,247100.0,NEAR BAY
696,-122.1,37.69,41.0,746.0,,387.0,161.0,3.9063,178400.0,NEAR BAY


In [27]:
filter2 = housing['ocean_proximity'] == 'NEAR BAY'
housing[filter2]['total_bedrooms'].value_counts()

total_bedrooms
236.0     11
353.0     11
190.0      9
322.0      9
348.0      9
          ..
786.0      1
737.0      1
1081.0     1
1611.0     1
988.0      1
Name: count, Length: 932, dtype: int64

Those are not representatives of the "NEAR BAY" category.

Then, we obtain that there is only one column with missing values. Those values are numerical values as previously identifyed, so it is not recommended to eliminate the rows from our dataframe. Those requires further analysis. 

The amount of rows with missing values are `207`, those represent `1%` out of `20640` total rows. The knowledge of the Analyst should decide what to do with them.

From our general dataset, we can get the following statistics: