## Wrong Data

In [11]:
import pandas as pd
import numpy as np

In [12]:
df = pd.read_csv('stolen_vehicles(copy).csv', header=0)

In [13]:
df.head()

Unnamed: 0,vehicle_id,vehicle_type,make_id,model_year,vehicle_desc,color,date_stolen,location_id
0,1,Trailer,623,2021,BST2021D,Silver,11/5/21,102.0
1,2,Boat Trailer,623,2021,OUTBACK BOATS FT470,Silver,12/13/21,105.0
2,3,Boat Trailer,623,2021,ASD JETSKI,Silver,2/13/22,102.0
3,4,Trailer,623,2021,MSC 7X4,Silver,11/13/21,106.0
4,5,Trailer,623,2018,D-MAX 8X5,Silver,1/10/22,102.0


All 'NaN' values in the data set is the wrong data and should instead be zero. One way we can fix this is by using numpy to replace the 'NaN' values in a particular column
- by using **df['column_name'].replace(np.nan, 'value to replace the nan value', inpplace = True)**
    - "inplace=True" just means that we are not storing data in new variable, but instead in same cell
    
- Note: We are not replacing the "NaN" values in the whole DataFrame, but only those columns where it is needed and makes sense (int columns)

In [14]:
df['location_id'].replace(np.nan, 0, inplace = True)
df['vehicle_type'].replace(np.nan, 0, inplace = True)
df['make_id'].replace(np.nan, 0, inplace = True)
df['model_year'].replace(np.nan, 0, inplace = True)

In [15]:
df.head(15)

Unnamed: 0,vehicle_id,vehicle_type,make_id,model_year,vehicle_desc,color,date_stolen,location_id
0,1,Trailer,623,2021,BST2021D,Silver,11/5/21,102.0
1,2,Boat Trailer,623,2021,OUTBACK BOATS FT470,Silver,12/13/21,105.0
2,3,Boat Trailer,623,2021,ASD JETSKI,Silver,2/13/22,102.0
3,4,Trailer,623,2021,MSC 7X4,Silver,11/13/21,106.0
4,5,Trailer,623,2018,D-MAX 8X5,Silver,1/10/22,102.0
5,6,Roadbike,636,2005,YZF-R6T,Black,12/31/21,102.0
6,7,Trailer,623,2021,CAAR TRANSPORTER,Silver,11/12/21,114.0
7,8,Boat Trailer,623,2001,BOAT,Silver,2/22/22,109.0
8,9,Trailer,514,2021,"7X4-6"" 1000KG",Silver,2/25/22,115.0
9,10,Trailer,514,2020,8X4 TANDEM,Silver,1/3/22,0.0


Let say that we want to remove integer values that we entered into a column that is supposed to only contain string values:
- we can use *df['ColumnName'] = df['ColumnName'].replace(to_replace=r'\d+', value='', regex=True)*

**Con**
- This method is not approriate since it also replaced cells that are made of both string and integers. We only wanted to replace the cells containing strictly integers.

In [8]:
#does not work that well
#df['vehicle_desc'] = df['vehicle_desc'].replace(to_replace=r'\d+', value='None', regex=True)

The code below is a solution to the problem/Con above.

In [16]:
# Define a function to replace cells containing only integers
def replace_integers(cell):
    try:
        # Attempt to convert the cell content to an integer
        int_value = int(cell)
        return 'None'  # Replace with your desired value
    except ValueError:
        return cell  # Return the original value if it's not an integer

# Apply the function to the specified column
column_name = 'vehicle_desc'  # Replace with the actual column name
df[column_name] = df[column_name].apply(replace_integers)


In [17]:
df.head(98)

Unnamed: 0,vehicle_id,vehicle_type,make_id,model_year,vehicle_desc,color,date_stolen,location_id
0,1,Trailer,623,2021,BST2021D,Silver,11/5/21,102.0
1,2,Boat Trailer,623,2021,OUTBACK BOATS FT470,Silver,12/13/21,105.0
2,3,Boat Trailer,623,2021,ASD JETSKI,Silver,2/13/22,102.0
3,4,Trailer,623,2021,MSC 7X4,Silver,11/13/21,106.0
4,5,Trailer,623,2018,D-MAX 8X5,Silver,1/10/22,102.0
...,...,...,...,...,...,...,...,...
93,94,Trailer,623,2018,COMPASS-C85,Silver,3/18/22,102.0
94,95,Trailer,616,2018,,Silver,12/28/21,102.0
95,96,Trailer,527,1985,TANDEM,Silver,1/13/22,102.0
96,97,Roadbike,550,2005,VT,Red,3/19/22,102.0
