### Introduction

Rice is one of the most important staple crops worldwide, particularly in Asia where it serves as the primary food source for billions of people. This analysis focuses on understanding rice cultivation practices through a dataset collected from farmers, examining factors that influence yield and farming efficiency.

The dataset contains comprehensive information about:

Farmer demographics and land characteristics

Soil and drainage conditions

Fertilizer usage (both organic and chemical)

Irrigation practices

Pest and disease management

Harvest outcomes

### Objectives

Data Cleaning and Preparation:

Standardize and clean the raw dataset for analysis

Handle missing values and outliers

Transform variables into consistent formats

### Exploratory Data Analysis:

Understand distribution of key variables

Identify relationships between farming practices and yield

Analyze patterns in fertilizer usage and irrigation

### Practice Analysis:

Compare organic vs chemical fertilizer adoption

Examine timing of transplanting and harvesting

Evaluate disease prevalence and treatment methods

### Yield Optimization:

Identify factors most correlated with high yield

Suggest potential improvements to farming practices



### We start by importing essential Python libraries for data handling and manipulation.

1.pandas for structured data operations.

2.numpy for numerical operations.

3.os for interacting with the operating system and directory structures.


In [24]:
import pandas as pd 
import numpy as np 
import os

## Read in the data

In [26]:
df = pd.read_excel(r"C:\Users\Ashulah\Downloads\data-rice-cultivation\data\Paddy_final data_2023.xlsx")
df

Unnamed: 0,deviceid,start,end,collectionDate,uniqueID,surveyConsent,disctrict,block,FID,q201_LLU,...,residPerc_001,_7_3_,VideoSeen,SelectVideo,SelectVideo/climateChangeAndEffectOnAgri,SelectVideo/smartKisan,SelectVideo/reducingEffects,SelectVideo/earlyPlanting,SelectVideo/correctNuse,coordinates
0,collect:wZycH0QM1S1xarcY,2023-08-07T18:50:22.656+05:30,2023-08-08T07:48:02.663+05:30,2023-08-07,202308080747453,yes,Siwan,Pachrukhi,11048,Kattha,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,1,26.1991289 84.4457247 0.0 0.0;26.1990164 84.44...
1,collect:wZycH0QM1S1xarcY,2023-08-08T07:54:07.093+05:30,2023-08-08T08:02:17.188+05:30,2023-08-08,202308080801503,yes,Siwan,Pachrukhi,12569,Kattha,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,1,26.1965661 84.4460175 0.0 0.0;26.1965928 84.44...
2,collect:wZycH0QM1S1xarcY,2023-08-08T09:37:47.100+05:30,2023-08-08T09:48:57.548+05:30,2023-08-08,202308080948183,yes,Siwan,Pachrukhi,12485,Kattha,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,1,26.1968651 84.4463896 0.0 0.0;26.1969785 84.44...
3,collect:wZycH0QM1S1xarcY,2023-08-08T18:11:56.249+05:30,2023-08-08T18:19:09.514+05:30,2023-08-08,202308081818373,yes,Siwan,Pachrukhi,9649,Kattha,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,1,26.1965455 84.4468727 0.0 0.0;26.1965455 84.44...
4,collect:wZycH0QM1S1xarcY,2023-08-04T16:14:12.791+05:30,2023-08-09T11:17:12.543+05:30,2023-08-04,202308091100453,yes,Siwan,Pachrukhi,13111,Kattha,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,1,26.1970244 84.4458443 0.0 0.0;26.1971063 84.44...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24064,collect:UJGg6ekeNWyjbgq7,2023-11-02T12:09:30.437+05:30,2023-11-02T12:59:38.548+05:30,2023-11-02,202311021259343,yes,West_Champaran,Majhauliya,31773,Acre,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,0,
24065,collect:UJGg6ekeNWyjbgq7,2023-11-02T12:13:32.523+05:30,2023-11-02T12:59:54.009+05:30,2023-11-02,202311021259483,yes,West_Champaran,Majhauliya,31774,Acre,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,0,1,
24066,collect:UJGg6ekeNWyjbgq7,2023-11-02T12:19:15.817+05:30,2023-11-02T13:00:27.500+05:30,2023-11-02,202311021300233,yes,West_Champaran,Majhauliya,31775,Acre,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan earlyP...,1,1,1,1,0,
24067,collect:UJGg6ekeNWyjbgq7,2023-11-02T12:23:40.544+05:30,2023-11-02T13:00:41.507+05:30,2023-11-02,202311021300363,yes,West_Champaran,Majhauliya,31776,Acre,...,,plowed,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,0,


### Data Cleaning Process

### Column Selection and Filtering

Drop These Variables :Not Useful for Analysis

| Columns/variables                                                                                    | Reason                                                             |
| ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| `deviceid`, `start`, `end`, `collectionDate`                                                   | Metadata from survey collection                                    |
| `surveyConsent`                                                                                | Ethical, not analytical use                                        |
| `q201_LLU`, `q201a_land_dist`, `q201b_land_dist`, `q201c_land_dist`, `Unit_is_q201c_land_dist` | Used only for internal unit conversion (already have area in acre) |
| All fertilizer indicators like `q205_chemFertUsed/Urea` etc.                                   | Covered more clearly by type + quantity fields                     |
| All "Other" fields like `...OtherFert`, `...OtherReason`                                       | Often inconsistent and hard to categorize                          |
| `note_C2ZcvhmJv`, `SelectVideo/climateChangeAndEffectOnAgri` etc.                              | Redundant with `SelectVideo` or `SelectVideo` contains full info   |
| `coordinates`                                                                                  | Use only if you're mapping geo-positions, otherwise drop for EDA   |

This markdown table documents which columns will be removed from the dataset and why, helping maintain transparency in the data cleaning process.


## Keeping Relevant Columns

This code selects only the columns relevant to our analysis from the original dataset. We create a clean copy of the DataFrame with just these columns to work with.



In [30]:

# List of useful columns (keep these)
useful_columns = [
    # Identifiers and location
    'disctrict', 'block', 'FID', 'uniqueID',

    # Land and soil
    'q203_cultLand', 'q204_cropCultLand', 'q206_cropLarestAreaAcre',
    'q207_DrainageClass', 'q208_soilType',

    # Previous crop practices
    'q301_PreviousCrop', 'q302_prevCropTillage', 'q304_SowingTransplantinPCrop',

    # Crop management
    'q101_transmonth', 'q103_transDays',
    'q404_RiceTillageMonth', 'q405_RiceTillageMethod', 'q406_RiceTillageDepth',
    'q301_CropIrrigationApp', 'q301b_IrriTimes',

    # Organic fertilizer
    'q201_orgFert', 'q201a_orgFertWhich/Ganaura', 'q201a_orgFertWhich/FYM',
    'q201a_orgFertWhich/VermiCompost', 'q201a_orgFertWhich/PoultryManure',
    'q202a_orgFertQuant1', 'q202b_orgFertQuant2', 'q202c_orgFertQuant3', 'q202d_orgFertQuant4',

    # Chemical fertilizer usage
    'q203_chemFert', 'q204_chemFertTimes', 'q203aaa_landPrepFert', 'q205_chemFertUsed',
    'q205_chemFertUsed/Urea', 'q205_chemFertUsed/DAP', 'q205_chemFertUsed/NPKS',
    'q205_chemFertUsed/MoP', 'q205_chemFertUsed/SSP', 'q205_chemFertUsed/Zinc',
    'q205b_CropbasalUrea', 'q205c_CropbasalDAP', 'q205a_NPKStype',
    'q205d_CropbasalNPKS', 'q205e_CropbasalMoP', 'q205f_CropbasalSSP', 'q205g_CropbasalZinc',

    # Disease and treatment
    'cropDisease', 'selectDisease', 'selectDisease/leaves_yellowing', 'selectDisease/blast',
    'selectDisease/scorching', 'selectDisease/false_smut', 'selectDisease/others',
    'OrganicCure', 'SelectOrganicCure', 'SelectOrganicCure/brahmastra',
    'SelectOrganicCure/lohastra', 'SelectOrganicCure/jivamrit',
    'SelectOrganicCure/neemastra', 'SelectOrganicCure/mathastra', 'SelectOrganicCure/agniastra',

    # Weeding
    'weedingType', 'weedingMethod',

    # Harvest and yield
    'harvMonth', 'harvWeek', 'harvMethod', 'totalYield', 'harvMoney',

    # Extension video exposure
    'VideoSeen', 'SelectVideo'
]

# Filter only existing columns
valid_columns = [col for col in useful_columns if col in df.columns]
df_cleaned = df[valid_columns].copy()                                


In [31]:
df_cleaned

Unnamed: 0,disctrict,block,FID,uniqueID,q203_cultLand,q204_cropCultLand,q206_cropLarestAreaAcre,q207_DrainageClass,q208_soilType,q301_PreviousCrop,...,SelectOrganicCure/agniastra,weedingType,weedingMethod,harvMonth,harvWeek,harvMethod,totalYield,harvMoney,VideoSeen,SelectVideo
0,Siwan,Pachrukhi,11048,202308080747453,20.0,20.0,0.185,LowLand,blackSoil,Wheat,...,,2.0,byHand,2023-11-01,2,byHand,400.0,900.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...
1,Siwan,Pachrukhi,12569,202308080801503,15.0,15.0,0.148,MediumLand,Clay,Wheat,...,,2.0,byHand,2023-11-01,2,byHand,300.0,600.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...
2,Siwan,Pachrukhi,12485,202308080948183,15.0,15.0,0.148,Upland,SandySoils,Wheat,...,,2.0,byHand,2023-11-01,1,byHand,360.0,600.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...
3,Siwan,Pachrukhi,9649,202308081818373,20.0,20.0,0.185,MediumLand,Clay,Wheat,...,0.0,2.0,byHand,2023-11-01,3,byHand,420.0,900.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...
4,Siwan,Pachrukhi,13111,202308091100453,15.0,15.0,0.185,MediumLand,Clay,Wheat,...,,2.0,byHand,2023-11-01,3,byHand,500.0,1200.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24064,West_Champaran,Majhauliya,31773,202311021259343,5.0,5.0,5.000,Upland,SandyLoam,Wheat,...,,2.0,byHand,2023-10-01,2,byHand,580.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...
24065,West_Champaran,Majhauliya,31774,202311021259483,5.0,5.0,5.000,Upland,SandyLoam,Wheat,...,,2.0,byHand,2023-10-01,2,byHand,600.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...
24066,West_Champaran,Majhauliya,31775,202311021300233,5.0,5.0,5.000,MediumLand,SandySoils,Wheat,...,,1.0,byHand,2023-10-01,2,byHand,580.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan earlyP...
24067,West_Champaran,Majhauliya,31776,202311021300363,5.0,5.0,5.000,Upland,SandyLoam,Wheat,...,,2.0,byHand,2023-10-01,2,byHand,600.0,1000.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...


## Print the list of column names in the cleaned DataFrame

In [33]:
print(df_cleaned.columns.tolist())


['disctrict', 'block', 'FID', 'uniqueID', 'q203_cultLand', 'q204_cropCultLand', 'q206_cropLarestAreaAcre', 'q207_DrainageClass', 'q208_soilType', 'q301_PreviousCrop', 'q302_prevCropTillage', 'q304_SowingTransplantinPCrop', 'q101_transmonth', 'q103_transDays', 'q404_RiceTillageMonth', 'q405_RiceTillageMethod', 'q406_RiceTillageDepth', 'q301_CropIrrigationApp', 'q301b_IrriTimes', 'q201_orgFert', 'q201a_orgFertWhich/Ganaura', 'q201a_orgFertWhich/FYM', 'q201a_orgFertWhich/VermiCompost', 'q201a_orgFertWhich/PoultryManure', 'q202a_orgFertQuant1', 'q202b_orgFertQuant2', 'q202c_orgFertQuant3', 'q202d_orgFertQuant4', 'q203_chemFert', 'q204_chemFertTimes', 'q203aaa_landPrepFert', 'q205_chemFertUsed', 'q205_chemFertUsed/Urea', 'q205_chemFertUsed/DAP', 'q205_chemFertUsed/NPKS', 'q205_chemFertUsed/MoP', 'q205_chemFertUsed/SSP', 'q205_chemFertUsed/Zinc', 'q205b_CropbasalUrea', 'q205c_CropbasalDAP', 'q205a_NPKStype', 'q205d_CropbasalNPKS', 'q205e_CropbasalMoP', 'q205f_CropbasalSSP', 'q205g_CropbasalZ

## Column Renaming
This code standardizes column names to be more descriptive and consistent, making the dataset easier to work with and understand

Example: 
q208_soilType: soil_type

In [35]:
rename_dict = {
    "uniqueID": "farmer_id",
    "disctrict": "district",
    "block": "block",
    "FID": "record_id",
    "q203_cultLand": "total_cultivable_land_llu",
    "q204_cropCultLand": "land_under_rice_llu",
    "q206_cropLarestAreaAcre": "largest_rice_plot_area_llu",
    "q207_DrainageClass": "drainage_class",
    "q208_soilType": "soil_type",
    "q301_PreviousCrop": "previous_crop",
    "q302_prevCropTillage": "previous_crop_tillage_method",
    "q304_SowingTransplantinPCrop": "previous_crop_transplant_date",
    "q101_transmonth": "transplant_month",
    "q103_transDays": "seedling_age_days",
    "q404_RiceTillageMonth": "rice_tillage_month",
    "q405_RiceTillageMethod": "rice_tillage_method",
    "q406_RiceTillageDepth": "rice_tillage_depth_cm",
    "q301_CropIrrigationApp": "irrigation_applied_flag",
    "q301b_IrriTimes": "irrigation_event_count",
    "q201_orgFert": "organic_fertilizer_used_flag",
    "q201a_orgFertWhich/Ganaura": "organic_ganaura_used",
    "q201a_orgFertWhich/FYM": "organic_fym_used",
    "q201a_orgFertWhich/VermiCompost": "organic_vermicompost_used",
    "q201a_orgFertWhich/PoultryManure": "organic_poultry_manure_used",
    "q202a_orgFertQuant1": "organic_fert_qty_type1",
    "q202b_orgFertQuant2": "organic_fert_qty_type2",
    "q202c_orgFertQuant3": "organic_fert_qty_type3",
    "q202d_orgFertQuant4": "organic_fert_qty_type4",
    "q203_chemFert": "chemical_fertilizer_used_flag",
    "q204_chemFertTimes": "chemical_fertilizer_application_count",
    "q203aaa_landPrepFert": "fertilizer_used_during_land_preparation",
    "q205_chemFertUsed": "chemical_fertilizers_used_list",
    "q205_chemFertUsed/Urea": "chem_fert_urea_used_flag",
    "q205_chemFertUsed/DAP": "chem_fert_dap_used_flag",
    "q205_chemFertUsed/NPKS": "chem_fert_npks_used_flag",
    "q205_chemFertUsed/MoP": "chem_fert_mop_used_flag",
    "q205_chemFertUsed/SSP": "chem_fert_ssp_used_flag",
    "q205_chemFertUsed/Zinc": "chem_fert_zinc_used_flag",
    "q205b_CropbasalUrea": "basal_urea_kg",
    "q205c_CropbasalDAP": "basal_dap_kg",
    "q205a_NPKStype": "npks_type",
    "q205d_CropbasalNPKS": "basal_npks_kg",
    "q205e_CropbasalMoP": "basal_mop_kg",
    "q205f_CropbasalSSP": "basal_ssp_kg",
    "q205g_CropbasalZinc": "basal_zinc_kg",
    "cropDisease": "disease_observed_flag",
    "selectDisease": "disease_type",
    "selectDisease/leaves_yellowing": "disease_leaves_yellowing",
    "selectDisease/blast": "disease_blast",
    "selectDisease/scorching": "disease_scorching",
    "selectDisease/false_smut": "disease_false_smut",
    "selectDisease/others": "disease_others",
    "OrganicCure": "organic_pesticide_used_flag",
    "SelectOrganicCure": "organic_pesticide_type",
    "SelectOrganicCure/brahmastra": "used_brahmastra_flag",
    "SelectOrganicCure/lohastra": "used_lohastra_flag",
    "SelectOrganicCure/jivamrit": "used_jivamrit_flag",
    "SelectOrganicCure/neemastra": "used_neemastra_flag",
    "SelectOrganicCure/mathastra": "used_mathastra_flag",
    "SelectOrganicCure/agniastra": "used_agniastra_flag",
    "weedingType": "weeding_type",
    "weedingMethod": "weeding_method",
    "harvMonth": "harvest_month",
    "harvWeek": "harvest_week",
    "harvMethod": "harvest_method",
    "totalYield": "yield_kg",
    "harvMoney": "harvest_income_inr",
    "VideoSeen": "video_seen_flag",
    "SelectVideo": "video_topic_seen"
}
df_cleaned.rename(columns=rename_dict, inplace=True)


This is the list of new column that we have after renaming 

In [37]:
print(df_cleaned.columns.tolist())

['district', 'block', 'record_id', 'farmer_id', 'total_cultivable_land_llu', 'land_under_rice_llu', 'largest_rice_plot_area_llu', 'drainage_class', 'soil_type', 'previous_crop', 'previous_crop_tillage_method', 'previous_crop_transplant_date', 'transplant_month', 'seedling_age_days', 'rice_tillage_month', 'rice_tillage_method', 'rice_tillage_depth_cm', 'irrigation_applied_flag', 'irrigation_event_count', 'organic_fertilizer_used_flag', 'organic_ganaura_used', 'organic_fym_used', 'organic_vermicompost_used', 'organic_poultry_manure_used', 'organic_fert_qty_type1', 'organic_fert_qty_type2', 'organic_fert_qty_type3', 'organic_fert_qty_type4', 'chemical_fertilizer_used_flag', 'chemical_fertilizer_application_count', 'fertilizer_used_during_land_preparation', 'chemical_fertilizers_used_list', 'chem_fert_urea_used_flag', 'chem_fert_dap_used_flag', 'chem_fert_npks_used_flag', 'chem_fert_mop_used_flag', 'chem_fert_ssp_used_flag', 'chem_fert_zinc_used_flag', 'basal_urea_kg', 'basal_dap_kg', 'npk

## All variable That we will be dealing with
 #### Farmer Identity & Location
| Column      | Description                                    |
| ----------- | ---------------------------------------------- |
| `district`  | Name of the district where the farmer resides  |
| `block`     | Subdivision/block within the district          |
| `record_id` | Internal system record identifier              |
| `farmer_id` | Unique ID assigned to each farmer (anonymized) |
### Land Ownership & Rice Area (in LLU:local Land Units)
| Column                       | Description                              |
| ---------------------------- | ---------------------------------------- |
| `total_cultivable_land_llu`  | Total land accessible (owned + leased)   |
| `land_under_rice_llu`        | Portion of land under rice cultivation   |
| `largest_rice_plot_area_llu` | Area of the largest individual rice plot |
### Land Ownership & Rice Area (in LLU : Local Land Units)

| Column                       | Description                              |
| ---------------------------- | ---------------------------------------- |
| `total_cultivable_land_llu`  | Total land accessible (owned + leased)   |
| `land_under_rice_llu`        | Portion of land under rice cultivation   |
| `largest_rice_plot_area_llu` | Area of the largest individual rice plot |
 ### Soil & Drainage Characteristics
| Column           | Description                                       |
| ---------------- | ------------------------------------------------- |
| `drainage_class` | Drainage type of the plot (e.g., lowland, upland) |
| `soil_type`      | Soil category (e.g., loamy, clayey, sandy)        |
### Previous Crop & Land Preparation
| Column                          | Description                                |
| ------------------------------- | ------------------------------------------ |
| `previous_crop`                 | Crop grown prior to current rice cycle     |
| `previous_crop_tillage_method`  | Tillage method used for the previous crop  |
| `previous_crop_transplant_date` | Sowing or transplant date of previous crop |

### Previous Crop & Land Preparation

| Column                          | Description                                |
| ------------------------------- | ------------------------------------------ |
| `previous_crop`                 | Crop grown prior to current rice cycle     |
| `previous_crop_tillage_method`  | Tillage method used for the previous crop  |
| `previous_crop_transplant_date` | Sowing or transplant date of previous crop |

### Transplanting & Rice Tillage
| Column                  | Description                                 |
| ----------------------- | ------------------------------------------- |
| `transplant_month`      | Month of rice transplanting                 |
| `seedling_age_days`     | Age of seedlings at transplant (in days)    |
| `rice_tillage_month`    | Month when land was tilled for rice         |
| `rice_tillage_method`   | Tillage method used before rice cultivation |
| `rice_tillage_depth_cm` | Depth of tillage (in cm)                    |
### Organic Fertilizer Use
| Column                         | Description                            |
| ------------------------------ | -------------------------------------- |
| `organic_fertilizer_used_flag` | Whether organic fertilizer was applied |
| `organic_ganaura_used`         | Use of Ganaura (1 = Yes, 0 = No)       |
| `organic_fym_used`             | Use of FYM (Farmyard Manure)           |
| `organic_vermicompost_used`    | Use of Vermicompost                    |
| `organic_poultry_manure_used`  | Use of Poultry Manure                  |
| `organic_fert_qty_type1`       | Quantity of organic fertilizer type 1  |
| `organic_fert_qty_type2`       | Quantity of organic fertilizer type 2  |
| `organic_fert_qty_type3`       | Quantity of organic fertilizer type 3  |
| `organic_fert_qty_type4`       | Quantity of organic fertilizer type 4  |
### Chemical Fertilizer Use 
| Column                                    | Description                                |
| ----------------------------------------- | ------------------------------------------ |
| `chemical_fertilizer_used_flag`           | Whether chemical fertilizers were used     |
| `chemical_fertilizer_application_count`   | Number of chemical fertilizer applications |
| `fertilizer_used_during_land_preparation` | Use of fertilizer during land preparation  |
| `chemical_fertilizers_used_list`          | List of chemical fertilizers used          |
| `chem_fert_urea_used_flag`                | Use of Urea                                |
| `chem_fert_dap_used_flag`                 | Use of DAP                                 |
| `chem_fert_npks_used_flag`                | Use of NPKS                                |
| `chem_fert_mop_used_flag`                 | Use of MOP                                 |
| `chem_fert_ssp_used_flag`                 | Use of SSP                                 |
| `chem_fert_zinc_used_flag`                | Use of Zinc                                |
### Basal (Early) Fertilizer Application 
| Column          | Description                                           |
| --------------- | ----------------------------------------------------- |
| `basal_urea_kg` | Urea applied during basal application (kg)            |
| `basal_dap_kg`  | DAP applied during basal application (kg)             |
| `npks_type`     | Type of NPKS fertilizer used                          |
| `basal_npks_kg` | NPKS fertilizer applied during basal application (kg) |
| `basal_mop_kg`  | MOP applied during basal application (kg)             |
| `basal_ssp_kg`  | SSP applied during basal application (kg)             |
| `basal_zinc_kg` | Zinc applied during basal application (kg)            |
### Crop Disease & Organic Remedies 
| Column                        | Description                              |
| ----------------------------- | ---------------------------------------- |
| `disease_observed_flag`       | Whether disease was observed in the crop |
| `disease_type`                | Text description of observed disease     |
| `disease_leaves_yellowing`    | Specific disease type: Leaves yellowing  |
| `disease_blast`               | Specific disease type: Blast             |
| `disease_scorching`           | Specific disease type: Scorching         |
| `disease_false_smut`          | Specific disease type: False smut        |
| `disease_others`              | Specific disease type: Other             |
| `organic_pesticide_used_flag` | Whether organic remedies were used       |
| `organic_pesticide_type`      | Type of organic pesticide/remedy used    |
| `used_brahmastra_flag`        | Whether Brahmastra was used              |
| `used_lohastra_flag`          | Whether Lohastra was used                |
| `used_jivamrit_flag`          | Whether Jivamrit was used                |
| `used_neemastra_flag`         | Whether Neemastra was used               |
| `used_mathastra_flag`         | Whether Mathastra was used               |
| `used_agniastra_flag`         | Whether Agniastra was used               |
### Crop Disease & Organic Remedies(treatment medecine)
| Column                        | Description                              |
| ----------------------------- | ---------------------------------------- |
| `disease_observed_flag`       | Whether disease was observed in the crop |
| `disease_type`                | Text description of observed disease     |
| `disease_leaves_yellowing`    | Specific disease type: Leaves yellowing  |
| `disease_blast`               | Specific disease type: Blast             |
| `disease_scorching`           | Specific disease type: Scorching         |
| `disease_false_smut`          | Specific disease type: False smut        |
| `disease_others`              | Specific disease type: Other             |
| `organic_pesticide_used_flag` | Whether organic remedies were used       |
| `organic_pesticide_type`      | Type of organic pesticide/remedy used    |
| `used_brahmastra_flag`        | Whether Brahmastra was used              |
| `used_lohastra_flag`          | Whether Lohastra was used                |
| `used_jivamrit_flag`          | Whether Jivamrit was used                |
| `used_neemastra_flag`         | Whether Neemastra was used               |
| `used_mathastra_flag`         | Whether Mathastra was used               |
| `used_agniastra_flag`         | Whether Agniastra was used               |
### Weeding 
| Column           | Description                                 |
| ---------------- | ------------------------------------------- |
| `weeding_type`   | Type of weeding practiced                   |
| `weeding_method` | Method used for weeding (manual/mechanical) |
### Harvest & Post-Harvest 
| Column               | Description                                       |
| -------------------- | ------------------------------------------------- |
| `harvest_month`      | Month when harvest took place                     |
| `harvest_week`       | Week when harvest occurred                        |
| `harvest_method`     | Method used for harvesting (Manual or Mechanical) |
| `yield_kg`           | Total yield in kilograms                          |
| `harvest_income_inr` | Income from harvest in Indian Rupees (INR)        |
###  Extension / Advisory Exposure 
| Column             | Description                         |
| ------------------ | ----------------------------------- |
| `video_seen_flag`  | Whether any advisory video was seen |
| `video_topic_seen` | Topics or titles of the videos seen |


## Handling Missing Values


In [40]:
df_cleaned.columns

Index(['district', 'block', 'record_id', 'farmer_id',
       'total_cultivable_land_llu', 'land_under_rice_llu',
       'largest_rice_plot_area_llu', 'drainage_class', 'soil_type',
       'previous_crop', 'previous_crop_tillage_method',
       'previous_crop_transplant_date', 'transplant_month',
       'seedling_age_days', 'rice_tillage_month', 'rice_tillage_method',
       'rice_tillage_depth_cm', 'irrigation_applied_flag',
       'irrigation_event_count', 'organic_fertilizer_used_flag',
       'organic_ganaura_used', 'organic_fym_used', 'organic_vermicompost_used',
       'organic_poultry_manure_used', 'organic_fert_qty_type1',
       'organic_fert_qty_type2', 'organic_fert_qty_type3',
       'organic_fert_qty_type4', 'chemical_fertilizer_used_flag',
       'chemical_fertilizer_application_count',
       'fertilizer_used_during_land_preparation',
       'chemical_fertilizers_used_list', 'chem_fert_urea_used_flag',
       'chem_fert_dap_used_flag', 'chem_fert_npks_used_flag',
       '

In [41]:
df_cleaned.isna().sum()

district                       0
block                          0
record_id                      0
farmer_id                      0
total_cultivable_land_llu     26
                            ... 
harvest_method                 0
yield_kg                       0
harvest_income_inr           524
video_seen_flag                0
video_topic_seen               0
Length: 69, dtype: int64

this code checks the missing values that we have in our dataframe 

 #### Show only columns with missing (NaN) values

In [44]:
missing_values = df_cleaned.isna().sum()
missing_with_values = missing_values[missing_values > 0]

# Display
print(missing_with_values)

total_cultivable_land_llu                     26
land_under_rice_llu                           26
largest_rice_plot_area_llu                    26
drainage_class                                26
soil_type                                     26
previous_crop                                 26
previous_crop_tillage_method                7194
previous_crop_transplant_date               7194
transplant_month                             112
seedling_age_days                            112
rice_tillage_month                            26
rice_tillage_method                           26
rice_tillage_depth_cm                         26
irrigation_event_count                      1756
organic_ganaura_used                        8243
organic_fym_used                            8243
organic_vermicompost_used                   8243
organic_poultry_manure_used                 8243
organic_fert_qty_type1                     20104
organic_fert_qty_type2                     12264
organic_fert_qty_typ

##  Handling Missing Values Based on Percentage of Null Rows

Before handling missing values, it is important to understand **how much data is missing** in each column. The following code calculates the percentage of missing values in each column and displays only the ones that have missing data:

This helps in deciding the appropriate strategy for handling missing data:

- Drop columns with too much missing data (e.g., > 50%)
- Impute missing values using mean, median, mode, or domain knowledge


In [46]:
total_rows = len(df_cleaned)

# Calculate % of missing values
missing_percent = df_cleaned.isna().sum() / total_rows * 100

# Display only columns with missing data
missing_percent[missing_percent > 0].sort_values(ascending=False)


basal_ssp_kg                               99.439113
organic_fert_qty_type4                     99.011176
npks_type                                  97.345133
basal_npks_kg                              97.299431
organic_fert_qty_type3                     96.318916
basal_mop_kg                               91.931530
basal_zinc_kg                              90.905314
used_brahmastra_flag                       87.789273
used_agniastra_flag                        87.789273
used_mathastra_flag                        87.789273
used_neemastra_flag                        87.789273
used_jivamrit_flag                         87.789273
used_lohastra_flag                         87.789273
organic_pesticide_type                     87.789273
organic_fert_qty_type1                     83.526528
organic_fert_qty_type2                     50.953509
disease_scorching                          45.041339
disease_blast                              45.041339
disease_leaves_yellowing                   45.

## Handle missing using different imputation methodes

- This code identifies columns with missing values, calculates what percentage of values are missing
  and then imputes missing values using median for numerical columns and mode for categorical columns.


In [48]:
# 1 Handle missing values of previous crop
df_cleaned['previous_crop'].fillna("not specified")

0        Wheat
1        Wheat
2        Wheat
3        Wheat
4        Wheat
         ...  
24064    Wheat
24065    Wheat
24066    Wheat
24067    Wheat
24068    Wheat
Name: previous_crop, Length: 24069, dtype: object

In [49]:
# 2
df_cleaned['total_cultivable_land_llu'].fillna(df_cleaned['total_cultivable_land_llu'].median())

0        20.0
1        15.0
2        15.0
3        20.0
4        15.0
         ... 
24064     5.0
24065     5.0
24066     5.0
24067     5.0
24068     5.0
Name: total_cultivable_land_llu, Length: 24069, dtype: float64

In [50]:
#3
df_cleaned['soil_type'].fillna("not specified")
df_cleaned['drainage_class'].fillna("not specified")

0           LowLand
1        MediumLand
2            Upland
3        MediumLand
4        MediumLand
            ...    
24064        Upland
24065        Upland
24066    MediumLand
24067        Upland
24068        Upland
Name: drainage_class, Length: 24069, dtype: object

In [51]:
#4
checkbox_cols = [col for col in df_cleaned.columns if '_flag' in col or 'used_' in col]
df_cleaned[checkbox_cols] = df_cleaned[checkbox_cols].fillna(0)


In [52]:
df_cleaned['chemical_fertilizers_used_list'].unique()

array(['Urea DAP', 0, 'Urea', 'DAP MoP', 'DAP Urea', 'DAP NPKS',
       'Urea DAP Zinc', 'Urea DAP Zinc MoP', 'Urea DAP MoP Zinc', 'DAP',
       'DAP NPKS Urea', 'NPKS DAP', 'Urea NPKS', 'DAP MoP Zinc', 'Zinc',
       'DAP NPKS Zinc', 'Urea SSP', 'Urea DAP MoP', 'DAP Zinc',
       'Zinc DAP', 'DAP Zinc MoP', 'Urea DAP NPKS', 'NPKS Urea',
       'Urea Other', 'Urea Other MoP', 'NPKS', 'MoP', 'MoP DAP',
       'Zinc DAP MoP', 'MoP DAP Zinc', 'DAP SSP Zinc', 'Urea NPKS Zinc',
       'DAP MoP Urea', 'MoP Zinc DAP', 'DAP Urea MoP', 'Urea NPKS DAP',
       'Urea MoP DAP', 'Urea MoP', 'DAP Other', 'MoP DAP Urea',
       'Other DAP', 'Urea DAP Other', 'Urea DAP SSP', 'Urea Zinc', 'SSP',
       'Urea NPKS SSP', 'MoP Urea', 'MoP Urea DAP', 'DAP Urea Zinc',
       'Urea NPKS MoP', 'Urea NPKS DAP Zinc', 'Urea Zinc DAP',
       'Urea DAP NPKS Zinc', 'DAP Urea Other', 'Other', 'Urea Other DAP',
       'Zinc Urea DAP', 'DAP Zinc Urea', 'Zinc MoP',
       'Urea DAP Zinc Other', 'DAP MoP Zinc Urea', 'D

In [53]:
#5
df_cleaned['chemical_fertilizers_used_list'].fillna("None")

0             Urea DAP
1             Urea DAP
2             Urea DAP
3             Urea DAP
4             Urea DAP
             ...      
24064    Urea DAP Zinc
24065    Urea DAP Zinc
24066    Urea DAP Zinc
24067    Urea DAP Zinc
24068    Urea DAP Zinc
Name: chemical_fertilizers_used_list, Length: 24069, dtype: object

In [54]:
df_cleaned['disease_type'].fillna("None")


0        None
1        None
2        None
3        None
4        None
         ... 
24064    None
24065    None
24066    None
24067    None
24068    None
Name: disease_type, Length: 24069, dtype: object

In [55]:
# Fill numeric fertilizer quantities with 0 where missing
fertilizer_qty_cols = [
    'organic_fert_qty_type3', 'organic_fert_qty_type4',
    'basal_npks_kg', 'basal_mop_kg', 'basal_ssp_kg', 'basal_zinc_kg'
]

for col in fertilizer_qty_cols:
    df_cleaned[f'{col}_missing'] = df_cleaned[col].isnull().astype(int)  # create flag
    df_cleaned[col] = df_cleaned[col].fillna(0)
df_cleaned['npks_type'] = df_cleaned['npks_type'].fillna('None')

In [56]:
df_cleaned['previous_crop_transplant_date'] = pd.to_datetime(df_cleaned['previous_crop_transplant_date'], errors='coerce')
most_common_month = df_cleaned['previous_crop_transplant_date'].mode()[0]
df_cleaned['transplant_month_missing'] = df_cleaned['previous_crop_transplant_date'].isnull().astype(int)
df_cleaned['previous_crop_transplant_date'].fillna(most_common_month)

0       2023-04-01
1       2023-04-01
2       2023-04-01
3       2023-04-01
4       2023-04-01
           ...    
24064   2023-07-01
24065   2023-07-01
24066   2023-07-01
24067   2023-07-01
24068   2023-04-01
Name: previous_crop_transplant_date, Length: 24069, dtype: datetime64[ns]

In [57]:
df_cleaned['irrigation_event_count'].fillna(df_cleaned['irrigation_event_count'].median())

0        2.0
1        2.0
2        2.0
3        2.0
4        2.0
        ... 
24064    1.0
24065    1.0
24066    1.0
24067    1.0
24068    1.0
Name: irrigation_event_count, Length: 24069, dtype: float64

In [58]:
df_cleaned['total_cultivable_land_llu_missing'] = df_cleaned['total_cultivable_land_llu'].isnull().astype(int)
df_cleaned['total_cultivable_land_llu'].fillna(df_cleaned['total_cultivable_land_llu'].median())

df_cleaned['largest_rice_plot_area_llu_missing'] = df_cleaned['largest_rice_plot_area_llu'].isnull().astype(int)
df_cleaned['largest_rice_plot_area_llu'].fillna(df_cleaned['largest_rice_plot_area_llu'].median())


0        0.185
1        0.148
2        0.148
3        0.185
4        0.185
         ...  
24064    5.000
24065    5.000
24066    5.000
24067    5.000
24068    5.000
Name: largest_rice_plot_area_llu, Length: 24069, dtype: float64

In [59]:
import pandas as pd

# ---- Fill missing dates with median (or mode) ----
date_cols = ['previous_crop_transplant_date', 'rice_tillage_month']
for col in date_cols:
    if col in df_cleaned.columns:
        df_cleaned[col] = pd.to_datetime(df_cleaned[col], errors='coerce')
        df_cleaned[f'{col}_missing'] = df_cleaned[col].isnull().astype(int)
        if not df_cleaned[col].dropna().empty:
            df_cleaned[col].fillna(df_cleaned[col].median())

# ---- Categorical columns ----
cat_cols = ['previous_crop_tillage_method', 'rice_tillage_method']
for col in cat_cols:
    if col in df_cleaned.columns:
        df_cleaned[f'{col}_missing'] = df_cleaned[col].isnull().astype(int)
        mode_val = df_cleaned[col].mode()[0] if not df_cleaned[col].dropna().empty else "Not specified"
        df_cleaned[col].fillna(mode_val)

# ---- Numeric columns (median fill + missing flag) ----
num_cols = [
    'irrigation_event_count',
    'chemical_fertilizer_application_count',
    'harvest_income_inr',
    'seedling_age_days',
    'land_under_rice_llu',
    'rice_tillage_depth_cm'
]

for col in num_cols:
    if col in df_cleaned.columns:
        df_cleaned[f'{col}_missing'] = df_cleaned[col].isnull().astype(int)
        df_cleaned[col].fillna(df_cleaned[col].median())


In [60]:
# ---- Binary Flag Columns (0/1): Fill with 0 ----
binary_flags = [
    'disease_leaves_yellowing', 'disease_blast', 'disease_scorching',
    'disease_false_smut', 'disease_others',
    'organic_ganaura_used', 'organic_fym_used',
    'organic_vermicompost_used', 'organic_poultry_manure_used'
]

df_cleaned[binary_flags] = df_cleaned[binary_flags].fillna(0)

# ---- Organic Fertilizer Quantities: Fill with 0, track missing ----
organic_qty_cols = ['organic_fert_qty_type1', 'organic_fert_qty_type2']
for col in organic_qty_cols:
    df_cleaned[f'{col}_missing'] = df_cleaned[col].isnull().astype(int)
    df_cleaned[col] = df_cleaned[col].fillna(0)

# ---- Categorical (Text): pesticide, disease_type ----
df_cleaned['organic_pesticide_type'] = df_cleaned['organic_pesticide_type'].fillna("None")
df_cleaned['organic_pesticide_type_missing'] = df_cleaned['organic_pesticide_type'].eq("None").astype(int)

df_cleaned['disease_type'] = df_cleaned['disease_type'].fillna("None")
df_cleaned['disease_type_missing'] = df_cleaned['disease_type'].eq("None").astype(int)

# ---- Land Area Columns: Fill with median ----
land_cols = ['total_cultivable_land_llu', 'largest_rice_plot_area_llu']
for col in land_cols:
    df_cleaned[f'{col}_missing'] = df_cleaned[col].isnull().astype(int)
    df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].median())


In [61]:
import pandas as pd

# STEP 1: Identify all missing flags
missing_flag_cols = [col for col in df_cleaned.columns if col.endswith('_missing')]

# STEP 2: Choose the target outcome (yield column)
# Replace with your actual yield column name
target_col = 'yield_kg'

# STEP 3: Check correlation between each flag and yield
correlated_flags = []

for col in missing_flag_cols:
    if df_cleaned[col].dtype in ['int64', 'int32', 'float64']:
        corr = df_cleaned[[col, target_col]].dropna().corr().iloc[0, 1]
        if abs(corr) >= 0.05:  # threshold — adjust as needed
            correlated_flags.append(col)
            print(f"{col}: correlation with {target_col} = {corr:.3f}")

# STEP 4: Drop uncorrelated missing flags
flags_to_drop = list(set(missing_flag_cols) - set(correlated_flags))
df_cleaned = df_cleaned.drop(columns=flags_to_drop)

print(f"\n Kept {len(correlated_flags)} _missing flags that correlate with yield.")
df_cleaned = df_cleaned.drop(columns=['transplant_month'], errors='ignore')
df_cleaned = df_cleaned.drop(columns=['transplant_month_missing'], errors='ignore')

organic_fert_qty_type3_missing: correlation with yield_kg = -0.115
basal_npks_kg_missing: correlation with yield_kg = -0.059
transplant_month_missing: correlation with yield_kg = -0.108
previous_crop_transplant_date_missing: correlation with yield_kg = -0.108
previous_crop_tillage_method_missing: correlation with yield_kg = -0.108
harvest_income_inr_missing: correlation with yield_kg = 0.052

 Kept 6 _missing flags that correlate with yield.


In [62]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24069 entries, 0 to 24068
Data columns (total 73 columns):
 #   Column                                   Non-Null Count  Dtype         
---  ------                                   --------------  -----         
 0   district                                 24069 non-null  object        
 1   block                                    24069 non-null  object        
 2   record_id                                24069 non-null  int64         
 3   farmer_id                                24069 non-null  int64         
 4   total_cultivable_land_llu                24069 non-null  float64       
 5   land_under_rice_llu                      24043 non-null  float64       
 6   largest_rice_plot_area_llu               24069 non-null  float64       
 7   drainage_class                           24043 non-null  object        
 8   soil_type                                24043 non-null  object        
 9   previous_crop                          

In [63]:
df_cleaned.isna().sum()


district                                 0
block                                    0
record_id                                0
farmer_id                                0
total_cultivable_land_llu                0
                                        ..
organic_fert_qty_type3_missing           0
basal_npks_kg_missing                    0
previous_crop_transplant_date_missing    0
previous_crop_tillage_method_missing     0
harvest_income_inr_missing               0
Length: 73, dtype: int64

### Checking for Duplicate Rows

To ensure data quality, it's important to check for any duplicated rows in the cleaned dataset:

**Result:** `0`

This means there are **no duplicated rows** in the dataset, which confirms that all records are unique — a good sign of data integrity.


In [65]:
df_cleaned.duplicated().sum()

0

In [66]:
df_cleaned['previous_crop_tillage_method'] = df_cleaned['previous_crop_tillage_method'].fillna("Not Specified")
# Ensure the column is datetime and just leave missing as NaT (Not a Time)
df_cleaned['previous_crop_transplant_date'] = pd.to_datetime(df_cleaned['previous_crop_transplant_date'], errors='coerce')


### Changing Standard of values and Data formatting
Standardizing missing/null values and binary fields is indeed a good practice that will help us in several ways:

Consistency - Ensures all missing values are treated the same way

Accuracy - Prevents analysis errors from mixed representations

In [68]:
df_cleaned['chemical_fertilizers_used_list'] = df_cleaned['chemical_fertilizers_used_list'].replace(0, None)
# Force conversion to integers (will raise error if not possible)
df_cleaned['farmer_id'] = pd.to_numeric(df_cleaned['farmer_id']).astype('Int64')

In [69]:
#Consider creating derived datetime features like:
df_cleaned['growth_duration'] = (df_cleaned['harvest_month'] - df_cleaned['rice_tillage_month']).dt.days
#For categorical data (like soil_type), we consider:
df_cleaned['soil_type'].str.lower().str.replace(' ', '').str.strip()
df_cleaned

Unnamed: 0,district,block,record_id,farmer_id,total_cultivable_land_llu,land_under_rice_llu,largest_rice_plot_area_llu,drainage_class,soil_type,previous_crop,...,yield_kg,harvest_income_inr,video_seen_flag,video_topic_seen,organic_fert_qty_type3_missing,basal_npks_kg_missing,previous_crop_transplant_date_missing,previous_crop_tillage_method_missing,harvest_income_inr_missing,growth_duration
0,Siwan,Pachrukhi,11048,202308080747453,20.0,20.0,0.185,LowLand,blackSoil,Wheat,...,400.0,900.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,0,123.0
1,Siwan,Pachrukhi,12569,202308080801503,15.0,15.0,0.148,MediumLand,Clay,Wheat,...,300.0,600.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,0,123.0
2,Siwan,Pachrukhi,12485,202308080948183,15.0,15.0,0.148,Upland,SandySoils,Wheat,...,360.0,600.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,0,123.0
3,Siwan,Pachrukhi,9649,202308081818373,20.0,20.0,0.185,MediumLand,Clay,Wheat,...,420.0,900.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,0,153.0
4,Siwan,Pachrukhi,13111,202308091100453,15.0,15.0,0.185,MediumLand,Clay,Wheat,...,500.0,1200.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,1,1,0,123.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24064,West_Champaran,Majhauliya,31773,202311021259343,5.0,5.0,5.000,Upland,SandyLoam,Wheat,...,580.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,0,0,0,92.0
24065,West_Champaran,Majhauliya,31774,202311021259483,5.0,5.0,5.000,Upland,SandyLoam,Wheat,...,600.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,0,0,0,92.0
24066,West_Champaran,Majhauliya,31775,202311021300233,5.0,5.0,5.000,MediumLand,SandySoils,Wheat,...,580.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan earlyP...,1,1,0,0,0,92.0
24067,West_Champaran,Majhauliya,31776,202311021300363,5.0,5.0,5.000,Upland,SandyLoam,Wheat,...,600.0,1000.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,1,1,0,0,0,92.0


The code standardizes categorical values by removing whitespace, converting to lowercase, and consolidating similar categories (like different spellings of the same soil type).

In [71]:
# 1. Standardize drainage_class categories
# Clean and standardize
df_cleaned['drainage_class'] = (
    df_cleaned['drainage_class'].str.strip()            # Remove whitespace
    .str.lower()                               # Convert to lowercase
)
# Verify standardization
print("\nStandardized drainage_class values:")
print(df_cleaned['drainage_class'].value_counts(dropna=False))


# 3. Additional categorical standardization (soil_type example)

df_cleaned['soil_type'] = (
    df_cleaned['soil_type'].str.strip()
    .str.lower()
    .str.replace(' ', '')
    .str.replace('soils', 'soil')
    .replace({
        'claysoil': 'claysoil',
        'sandysoil': 'sandysoil',
        'sandy loam': 'sandyloam',
        'alluvial': 'alluvialsoil',
        'black soil': 'blacksoil',
        'sandy': 'sandysoil',
        'clay':'claysoil'
    })
)

print("\nStandardized soil_type values:")
print(df_cleaned['soil_type'].value_counts(dropna=False))



Standardized drainage_class values:
drainage_class
mediumland    16216
upland         4712
lowland        3115
NaN              26
Name: count, dtype: int64

Standardized soil_type values:
soil_type
sandyloam       13100
alluvialsoil     2863
claysoil         2703
sandysoil        2648
blacksoil        2315
redsoil           293
other             121
NaN                26
Name: count, dtype: int64


In [72]:
df_cleaned['chemical_fertilizers_used_list'] = df_cleaned['chemical_fertilizers_used_list'].fillna("Not Specified")

### 1.Validation Check:

In [74]:
# Check for remaining inconsistencies
for col in ['drainage_class', 'soil_type']:
    print(f"\nUnique values in {col}:")
    print(sorted(df_cleaned[col].dropna().unique()))


Unique values in drainage_class:
['lowland', 'mediumland', 'upland']

Unique values in soil_type:
['alluvialsoil', 'blacksoil', 'claysoil', 'other', 'redsoil', 'sandyloam', 'sandysoil']


### Summary statistics of numeric fields (.describe()) to spot anomalies

In [76]:
df_cleaned.describe()

Unnamed: 0,record_id,farmer_id,total_cultivable_land_llu,land_under_rice_llu,largest_rice_plot_area_llu,previous_crop_transplant_date,seedling_age_days,rice_tillage_month,rice_tillage_depth_cm,irrigation_event_count,...,harvest_month,harvest_week,yield_kg,harvest_income_inr,organic_fert_qty_type3_missing,basal_npks_kg_missing,previous_crop_transplant_date_missing,previous_crop_tillage_method_missing,harvest_income_inr_missing,growth_duration
count,24069.0,24069.0,24069.0,24043.0,24069.0,16875,23957.0,24043,24043.0,22313.0,...,24069,24069.0,24069.0,23545.0,24069.0,24069.0,24069.0,24069.0,24069.0,24043.0
mean,17008.972703,202308782510463.28,277302300.0,16.505247,0.827076,2023-02-21 08:18:25.920000,23.137663,2023-06-23 19:35:16.458012672,11.750447,2.677094,...,2023-10-22 03:26:56.701981952,2.337405,502.748795,811.051331,0.963189,0.972994,0.298891,0.298891,0.021771,120.312066
min,2.0,202308031323053.0,0.04,0.02,-0.455,2022-07-01 00:00:00,14.0,2022-10-01 00:00:00,0.0,1.0,...,2023-10-01 00:00:00,1.0,-300.0,-300.0,0.0,0.0,0.0,0.0,0.0,30.0
25%,10228.0,202308211540503.0,8.0,6.0,0.148,2022-12-01 00:00:00,20.0,2023-06-01 00:00:00,6.0,2.0,...,2023-10-01 00:00:00,2.0,200.0,400.0,1.0,1.0,0.0,0.0,0.0,92.0
50%,17255.0,202308311843573.0,15.0,12.0,0.227,2023-03-01 00:00:00,22.0,2023-07-01 00:00:00,15.0,2.0,...,2023-11-01 00:00:00,2.0,360.0,600.0,1.0,1.0,0.0,0.0,0.0,123.0
75%,24350.0,202309170941333.0,24.0,20.0,0.455,2023-04-01 00:00:00,25.0,2023-07-01 00:00:00,15.0,3.0,...,2023-11-01 00:00:00,3.0,620.0,1000.0,1.0,1.0,1.0,1.0,0.0,123.0
max,31777.0,202311021306273.0,936007700000.0,420.0,100.0,2023-10-01 00:00:00,60.0,2023-10-01 00:00:00,15.0,15.0,...,2023-12-01 00:00:00,4.0,60000.0,40000.0,1.0,1.0,1.0,1.0,1.0,396.0
std,8611.623167,659601168.66795,14381090000.0,17.166966,3.140229,,5.207395,,4.379645,1.709318,...,,0.986213,930.335022,862.22519,0.188301,0.162103,0.457781,0.457781,0.145937,27.407806


In [77]:
# Generate and save descriptive statistics
desc_stats = df_cleaned.describe(include='all')
desc_stats

Unnamed: 0,district,block,record_id,farmer_id,total_cultivable_land_llu,land_under_rice_llu,largest_rice_plot_area_llu,drainage_class,soil_type,previous_crop,...,yield_kg,harvest_income_inr,video_seen_flag,video_topic_seen,organic_fert_qty_type3_missing,basal_npks_kg_missing,previous_crop_transplant_date_missing,previous_crop_tillage_method_missing,harvest_income_inr_missing,growth_duration
count,24069,24069,24069.0,24069.0,24069.0,24043.0,24069.0,24043,24043,24043,...,24069.0,23545.0,24069,24069,24069.0,24069.0,24069.0,24069.0,24069.0,24043.0
unique,11,33,,,,,,3,7,14,...,,,2,211,,,,,,
top,Siwan,Goriyakothi,,,,,,mediumland,sandyloam,Mungbean,...,,,yes,climateChangeAndEffectOnAgri smartKisan reduci...,,,,,,
freq,5035,2002,,,,,,16216,13100,8350,...,,,23584,11857,,,,,,
mean,,,17008.972703,202308782510463.28,277302300.0,16.505247,0.827076,,,,...,502.748795,811.051331,,,0.963189,0.972994,0.298891,0.298891,0.021771,120.312066
min,,,2.0,202308031323053.0,0.04,0.02,-0.455,,,,...,-300.0,-300.0,,,0.0,0.0,0.0,0.0,0.0,30.0
25%,,,10228.0,202308211540503.0,8.0,6.0,0.148,,,,...,200.0,400.0,,,1.0,1.0,0.0,0.0,0.0,92.0
50%,,,17255.0,202308311843573.0,15.0,12.0,0.227,,,,...,360.0,600.0,,,1.0,1.0,0.0,0.0,0.0,123.0
75%,,,24350.0,202309170941333.0,24.0,20.0,0.455,,,,...,620.0,1000.0,,,1.0,1.0,1.0,1.0,0.0,123.0
max,,,31777.0,202311021306273.0,936007700000.0,420.0,100.0,,,,...,60000.0,40000.0,,,1.0,1.0,1.0,1.0,1.0,396.0


#### 2.Data formatting 
- We try to handle type error wher yield was typed as negative and others

In [79]:
df_cleaned = df_cleaned.rename(columns={'weeding_type': 'weeding_times'})
df_cleaned['weeding_times'] = df_cleaned['weeding_times'].abs()
df_cleaned['yield_kg'] = df_cleaned['yield_kg'].abs()        
df_cleaned['harvest_income_inr'] = df_cleaned['harvest_income_inr'].abs()
df_cleaned['largest_rice_plot_area_llu'] =df_cleaned['largest_rice_plot_area_llu'].abs()

In [80]:
df_cleaned.isnull().sum()[df_cleaned.isnull().sum() > 0]


land_under_rice_llu                        26
drainage_class                             26
soil_type                                  26
previous_crop                              26
previous_crop_transplant_date            7194
seedling_age_days                         112
rice_tillage_month                         26
rice_tillage_method                        26
rice_tillage_depth_cm                      26
irrigation_event_count                   1756
chemical_fertilizer_application_count    1183
harvest_income_inr                        524
growth_duration                            26
dtype: int64

In [81]:
import pandas as pd
import numpy as np

# Columns with missing values (adjust based on your actual data)
null_columns = {
    'land_under_rice_llu': 'numeric',
    'drainage_class': 'categorical',
    'soil_type': 'categorical',
    'previous_crop': 'categorical',
    'previous_crop_transplant_date': 'datetime',
    'seedling_age_days': 'numeric',
    'rice_tillage_month': 'datetime', 
    'rice_tillage_method': 'categorical',
    'rice_tillage_depth_cm': 'numeric',
    'irrigation_event_count': 'numeric',
    'chemical_fertilizer_application_count': 'numeric',
    'harvest_income_inr': 'numeric',
    'growth_duration': 'numeric'
}

# Fill missing values accordingly
for col, dtype in null_columns.items():
    if dtype == 'numeric':
        # Assign back or use inplace=True
        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].median())
    elif dtype == 'categorical':
        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mode()[0])
    elif dtype == 'datetime':
        df_cleaned[col] = df_cleaned[col].fillna(pd.NaT)

# Verify no more missing values
print(" Missing values after filling:")
print(df_cleaned[null_columns.keys()].isnull().sum())

 Missing values after filling:
land_under_rice_llu                         0
drainage_class                              0
soil_type                                   0
previous_crop                               0
previous_crop_transplant_date            7194
seedling_age_days                           0
rice_tillage_month                         26
rice_tillage_method                         0
rice_tillage_depth_cm                       0
irrigation_event_count                      0
chemical_fertilizer_application_count       0
harvest_income_inr                          0
growth_duration                             0
dtype: int64


In [96]:
median_date = df_cleaned['previous_crop_transplant_date'].median()  # Median date
df_cleaned['previous_crop_transplant_date'] = df_cleaned['previous_crop_transplant_date'].fillna(median_date)

In [105]:
median_month = df_cleaned['rice_tillage_month'].median()  # Median month
df_cleaned['rice_tillage_month'] = df_cleaned['rice_tillage_month'].fillna(median_month)

### Handling Outliers 

In [108]:
# Create masks to identify outliers
mask_total = df_cleaned['total_cultivable_land_llu'] >= 20
mask_rice = df_cleaned['land_under_rice_llu'] >= 20
mask_plot = df_cleaned['largest_rice_plot_area_llu'] <= 0  # negative or zero values are invalid

# Replace outliers with random integers between 0 and 20 (inclusive)
df_cleaned.loc[mask_total, 'total_cultivable_land_llu'] = np.random.randint(0, 21, size=mask_total.sum())
df_cleaned.loc[mask_rice, 'land_under_rice_llu'] = np.random.randint(0, 20, size=mask_rice.sum())
df_cleaned.loc[mask_plot, 'largest_rice_plot_area_llu'] = np.random.randint(0, 20, size=mask_plot.sum())

# Summary: number of replacements done
print(" Replacements done:")
print(f"total_cultivable_land_llu: {mask_total.sum()} rows replaced")
print(f"land_under_rice_llu: {mask_rice.sum()} rows replaced")
print(f"largest_rice_plot_area_llu: {mask_plot.sum()} rows replaced")

# Optional: check updated statistics
print("\n Updated Summary Stats:")
print(df_cleaned[['total_cultivable_land_llu', 'land_under_rice_llu', 'largest_rice_plot_area_llu']].describe())

 Replacements done:
total_cultivable_land_llu: 25 rows replaced
land_under_rice_llu: 0 rows replaced
largest_rice_plot_area_llu: 3 rows replaced

 Updated Summary Stats:
       total_cultivable_land_llu  land_under_rice_llu  \
count               24069.000000         24069.000000   
mean                    9.561394             8.919199   
std                     5.226135             5.067142   
min                     0.000000             0.000000   
25%                     5.000000             5.000000   
50%                    10.000000             9.000000   
75%                    15.000000            13.000000   
max                    20.000000            19.000000   

       largest_rice_plot_area_llu  
count                24069.000000  
mean                     1.170585  
std                      3.696591  
min                      0.031000  
25%                      0.156000  
50%                      0.267000  
75%                      0.455000  
max                    100.0

In [110]:
columns_to_clean = ['harvest_income_inr', 'yield_kg']

for col in columns_to_clean:
    if col in df_cleaned.columns:
        # Step 1: Identify outliers (values < 10)
        outliers_mask = df_cleaned[col] < 10
        
        # Step 2: Compute median (excluding outliers if needed)
        median_val = df_cleaned[col].median()
        
        # Step 3: Replace outliers with median
        df_cleaned.loc[outliers_mask, col] = median_val
        
        # Step 4: Ensure no nulls remain (fill any remaining NaNs with median)
        df_cleaned[col] = df_cleaned[col].fillna(median_val)
        
        print(f"Replaced {outliers_mask.sum()} outliers in '{col}' with median = {median_val}")
    else:
        print(f"Column '{col}' not found. Skipping.")

Replaced 0 outliers in 'harvest_income_inr' with median = 600.0
Replaced 0 outliers in 'yield_kg' with median = 360.0


## Rounding Numerical Columns for Consistency

To improve data readability and maintain consistent formatting, we round selected numeric columns to **1 decimal place**.

 **Why this matters:**  
- Reduces unnecessary decimal clutter  
- Standardizes the format for reporting and visualization  
- Keeps data presentation clean and professional



In [113]:
# Round the specified columns to 1 decimal place
df_cleaned['basal_urea_kg'] = df_cleaned['basal_urea_kg'].round(1)
df_cleaned['basal_dap_kg'] = df_cleaned['basal_dap_kg'].round(1)
df_cleaned['total_cultivable_land_llu']=df_cleaned['total_cultivable_land_llu'].round(1)
df_cleaned['land_under_rice_llu']=df_cleaned['land_under_rice_llu'].round(1)

In [115]:
df_cleaned = df_cleaned.rename(columns={'weeding_type': 'weeding_times'})
df_cleaned['weeding_times'] = df_cleaned['weeding_times'].abs()
df_cleaned['yield_kg'] = df_cleaned['yield_kg'].abs()        
df_cleaned['harvest_income_inr'] = df_cleaned['harvest_income_inr'].abs()
df_cleaned['largest_rice_plot_area_llu'] =df_cleaned['largest_rice_plot_area_llu'].abs()

In [117]:
# Step 1: Remove outliers (values <1 or >5)
df_cleaned.loc[~df_cleaned['weeding_times'].between(1, 5), 'weeding_times'] = np.nan

# Step 2: Count how many nulls we created
null_count = df_cleaned['weeding_times'].isna().sum()
print(f"Created {null_count} null values by removing outliers")

# Step 3: Fill nulls with random integers between 2-5
df_cleaned['weeding_times'] = df_cleaned['weeding_times'].fillna(
    pd.Series(np.random.randint(2, 6, size=len(df)))
)

# Verification
print("\nValue counts after cleaning:")
print(df_cleaned['weeding_times'].value_counts().sort_index())

Created 0 null values by removing outliers

Value counts after cleaning:
weeding_times
1.0     9221
2.0    13200
3.0     1311
4.0      190
5.0      147
Name: count, dtype: int64


In [135]:
df_cleaned['fertilizer_used_during_land_preparation']= df_cleaned['fertilizer_used_during_land_preparation'].replace(0, "Not Specified")
df_cleaned['previous_crop_tillage_method'] = df_cleaned['previous_crop_tillage_method'].fillna("Not Specified")

In [127]:
df_cleaned.columns

Index(['district', 'block', 'record_id', 'farmer_id',
       'total_cultivable_land_llu', 'land_under_rice_llu',
       'largest_rice_plot_area_llu', 'drainage_class', 'soil_type',
       'previous_crop', 'previous_crop_tillage_method',
       'previous_crop_transplant_date', 'seedling_age_days',
       'rice_tillage_month', 'rice_tillage_method', 'rice_tillage_depth_cm',
       'irrigation_applied_flag', 'irrigation_event_count',
       'organic_fertilizer_used_flag', 'organic_ganaura_used',
       'organic_fym_used', 'organic_vermicompost_used',
       'organic_poultry_manure_used', 'organic_fert_qty_type1',
       'organic_fert_qty_type2', 'organic_fert_qty_type3',
       'organic_fert_qty_type4', 'chemical_fertilizer_used_flag',
       'chemical_fertilizer_application_count',
       'fertilizer_used_during_land_preparation',
       'chemical_fertilizers_used_list', 'chem_fert_urea_used_flag',
       'chem_fert_dap_used_flag', 'chem_fert_npks_used_flag',
       'chem_fert_mop_used_f

In [125]:
columns_to_drop = [
    'organic_fert_qty_type3_missing',
    'basal_npks_kg_missing',
    'previous_crop_transplant_date_missing',
    'previous_crop_tillage_method_missing',
    'harvest_income_inr_missing'
]

df_cleaned = df_cleaned.drop(columns=columns_to_drop)

In [129]:
df_cleaned.isnull().sum()[df_cleaned.isnull().sum() > 0]

Series([], dtype: int64)

In [137]:
df_cleaned['fertilizer_used_during_land_preparation'].unique()

array(['yes', 'no', 'Not Specified'], dtype=object)

In [158]:
df_cleaned['total_cultivable_land_llu'] = (df_cleaned['total_cultivable_land_llu'].replace(0, 5))
df_cleaned['land_under_rice_llu'] = (df_cleaned['total_cultivable_land_llu'].replace(0,5))

In [160]:
df_cleaned['chemical_fertilizer_used_flag'] = (df_cleaned['chemical_fertilizer_used_flag'].replace(0,"not specified"))

##### Save the DataFrame to CSV

In [163]:
df_cleaned

Unnamed: 0,district,block,record_id,farmer_id,total_cultivable_land_llu,land_under_rice_llu,largest_rice_plot_area_llu,drainage_class,soil_type,previous_crop,...,weeding_times,weeding_method,harvest_month,harvest_week,harvest_method,yield_kg,harvest_income_inr,video_seen_flag,video_topic_seen,growth_duration
0,Siwan,Pachrukhi,11048,202308080747453,5.0,5.0,0.185,lowland,blacksoil,Wheat,...,2.0,byHand,2023-11-01,2,byHand,400.0,900.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,123.0
1,Siwan,Pachrukhi,12569,202308080801503,15.0,15.0,0.148,mediumland,claysoil,Wheat,...,2.0,byHand,2023-11-01,2,byHand,300.0,600.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,123.0
2,Siwan,Pachrukhi,12485,202308080948183,15.0,15.0,0.148,upland,sandysoil,Wheat,...,2.0,byHand,2023-11-01,1,byHand,360.0,600.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,123.0
3,Siwan,Pachrukhi,9649,202308081818373,16.0,16.0,0.185,mediumland,claysoil,Wheat,...,2.0,byHand,2023-11-01,3,byHand,420.0,900.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,153.0
4,Siwan,Pachrukhi,13111,202308091100453,15.0,15.0,0.185,mediumland,claysoil,Wheat,...,2.0,byHand,2023-11-01,3,byHand,500.0,1200.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,123.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24064,West_Champaran,Majhauliya,31773,202311021259343,5.0,5.0,5.000,upland,sandyloam,Wheat,...,2.0,byHand,2023-10-01,2,byHand,580.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,92.0
24065,West_Champaran,Majhauliya,31774,202311021259483,5.0,5.0,5.000,upland,sandyloam,Wheat,...,2.0,byHand,2023-10-01,2,byHand,600.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,92.0
24066,West_Champaran,Majhauliya,31775,202311021300233,5.0,5.0,5.000,mediumland,sandysoil,Wheat,...,1.0,byHand,2023-10-01,2,byHand,580.0,800.0,yes,climateChangeAndEffectOnAgri smartKisan earlyP...,92.0
24067,West_Champaran,Majhauliya,31776,202311021300363,5.0,5.0,5.000,upland,sandyloam,Wheat,...,2.0,byHand,2023-10-01,2,byHand,600.0,1000.0,yes,climateChangeAndEffectOnAgri smartKisan reduci...,92.0


In [165]:
# Specify the target directory (change this to your desired path)
target_directory = r"C:\Users\Ashulah\Downloads\data-rice-cultivation\data"  # Windows example
# Or for Mac/Linux: target_directory = "/path/to/your/target/folder/"

# Create the directory if it doesn't exist
os.makedirs(target_directory, exist_ok=True)

# Specify the full file path
output_file_path = os.path.join(target_directory, "Cleaned_data.csv")

# Save the DataFrame to CSV in the specified folder
df_cleaned.to_csv(output_file_path, index=False)

print(f"File successfully saved to: {output_file_path}")


File successfully saved to: C:\Users\Ashulah\Downloads\data-rice-cultivation\data\Cleaned_data.csv


### Summary
This cleaned dataset will:

Include all quantitative fertilizer variables

Maintain fertilizer application rounds (basal)

Retain management practices and yield variables

Be ready for analysis of correlation between practices and yield