<a href="https://colab.research.google.com/github/Virendrashah02/first-repo/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import pandas as pd

data = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})
data

Unnamed: 0,name,age
0,Alice,25
1,Bob,30
2,Charlie,35


Using Pandas
For larger datasets, the Pandas library simplifies adding an index.


Pandas automatically provides an index, but you can add an explicit field for it:



Adding an index field is straightforward and can be implemented using simple loops, or with libraries like Pandas and NumPy for larger datasets. The method you choose depends on the dataset size and structure.

In [3]:
data['index'] = data.index + 1  # Start index from 1 instead of 0
print(data)


      name  age  index
0    Alice   25      1
1      Bob   30      2
2  Charlie   35      3


 How to Change Misleading Field Values Using R/Python


2. How to Change Misleading Field Values Using Python
Changing misleading field values ensures data accuracy and consistency. Misleading values may include placeholder values, typos, or inconsistent formatting. These values need to be replaced or corrected to improve data quality.

Common Cases of Misleading Values
Placeholder values: -1, 999, or 0 used to represent missing data.
Typographical errors: mlae instead of male.
Inconsistent representations: Yes vs. Y, No vs. N.
Missing values represented as strings: "N/A", "null",

In [6]:
import pandas as pd

data = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David"],
    "age": [25, -1, 35, -1],
    "gender": ["Female", "Male", "Unknown", "Male"]
})

Replace misleading values:
Replace -1 in the age column with NaN.
Replace Unknown in the gender column with Not Specified.

In [7]:
import numpy as np

data['age'].replace(-1, np.nan, inplace=True)
data['gender'].replace("Unknown", "Not Specified", inplace=True)

print(data)


      name   age         gender
0    Alice  25.0         Female
1      Bob   NaN           Male
2  Charlie  35.0  Not Specified
3    David   NaN           Male


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['age'].replace(-1, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['gender'].replace("Unknown", "Not Specified", inplace=True)


3. How to Re-express Categorical Field Values Using Pandas
Re-expressing categorical fields involves converting textual or symbolic categories into consistent or numerical representations. This is particularly useful for data consistency and preparing data for machine learning models.

In [9]:
import pandas as pd

data = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David"],
    "gender": ["Female", "Male", "Female", "Male"],
    "status": ["Single", "Married", "Single", "Divorced"]
})


**Using** replace with a Mapping Dictionary
You can map categorical values to numerical values:

In [11]:
gender_map = {"Female": 0, "Male": 1}
status_map = {"Single": 0, "Married": 1, "Divorced": 2}

data['gender'] = data['gender'].replace(gender_map)
data['status'] = data['status'].replace(status_map)

print(data)


      name  gender  status
0    Alice       0       0
1      Bob       1       1
2  Charlie       0       0
3    David       1       2


  data['gender'] = data['gender'].replace(gender_map)
  data['status'] = data['status'].replace(status_map)


How to Standardize Numeric Fields Using Pandas
Standardization scales numeric fields to have a mean of 0 and a standard deviation of 1. This is essential for many machine learning models.

In [12]:
data = pd.DataFrame({
    "height": [150, 160, 165, 170, 175],
    "weight": [50, 60, 65, 70, 75]
})


In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

data_standardized = pd.DataFrame(data_scaled, columns=data.columns)
print(data_standardized)


     height    weight
0 -1.627467 -1.627467
1 -0.464991 -0.464991
2  0.116248  0.116248
3  0.697486  0.697486
4  1.278724  1.278724
