# <span style="color:#336699; font-size:50px; font-weight:bold;">Project - Crop Production Analysis in India</span>

<span style="font-size:18px; color:#999999; font-style:italic;">Akash Patil <br> aakashgolu1008@gmail</span>



## Problem Statement:


The Agriculture business domain, as a vital part of the overall supply chain, is
expected to highly evolve in the upcoming years via the developments, which are
taking place on the side of the Future Internet. This paper presents a novel
Business-to-Business collaboration platform from the agri-food sector perspective,
which aims to facilitate the collaboration of numerous stakeholders belonging to
associated business domains, in an effective and flexible manner.
This dataset provides a huge amount of information on crop production in India
ranging from several years. Based on the Information the ultimate goal would be to
predict crop production and find important insights highlighting key indicators and
metrics that influence crop production.
Make views and dashboards first and also make a story out of it

## Dataset:
Dataset is available in the given link. You can download it at your convenience.

[Download Data](https://drive.google.com/file/d/1b3E1vpDSYpHe8YlNs3jkt30Lx6acf0Uo/view?usp=share_link)

**Project Workflow:**

1. **Data Cleaning:** Remove unnecessary columns and handle missing values.

2. **Exploratory Data Analysis (EDA):** Analyze production trends, crop distribution, and seasonal variations. Visualize data to identify patterns and correlations.

3. **Model Selection and Training:** Choose a regression model, split data, scale features, and train the model.

4. **Model Evaluation:** Assess model performance using Root Mean Squared Error (RMSE) and R-squared (R2)f5rmance.

6. **Conclusion and Documentation:** Summarize findings, draw conclusions, and document the workflow for future reference.

## Data Cleaning

### 2.1 Import Data and Required Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:,.2f}'.format


#### Import the CSV Data as Pandas DataFrame

In [2]:
# read the dataset
df=pd.read_csv("data/Crop Production data.csv")
df.head()

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production
0,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0
1,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Other Kharif pulses,2.0,1.0
2,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0
3,Andaman and Nicobar Islands,NICOBARS,2000,Whole Year,Banana,176.0,641.0
4,Andaman and Nicobar Islands,NICOBARS,2000,Whole Year,Cashewnut,720.0,165.0


### 2.2 Dataset information

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246091 entries, 0 to 246090
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   State_Name     246091 non-null  object 
 1   District_Name  246091 non-null  object 
 2   Crop_Year      246091 non-null  int64  
 3   Season         246091 non-null  object 
 4   Crop           246091 non-null  object 
 5   Area           246091 non-null  float64
 6   Production     242361 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.1+ MB


## Observations-
- In this dataset 7 columns and 246091 rows.
- In the dataset 2 float datatype columns, 4 object datatype columns and one int datatype column.
- in the Production column some missing values are present.

### Check statistics of data set

In [4]:
df.describe(include="all")

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production
count,246091,246091,246091.0,246091,246091,246091.0,242361.0
unique,33,646,,6,124,,
top,Uttar Pradesh,BIJAPUR,,Kharif,Rice,,
freq,33306,945,,95951,15104,,
mean,,,2005.64,,,12002.82,582503.44
std,,,4.95,,,50523.4,17065813.17
min,,,1997.0,,,0.04,0.0
25%,,,2002.0,,,80.0,88.0
50%,,,2006.0,,,582.0,729.0
75%,,,2010.0,,,4392.0,7023.0


## Observations-
- For int, float datatype
    - In the "crop_year" column, **years from the `2000` to the `2015` are present.**
    - In the "Area" column , **Minimum Area `400` and maximun area `85,80,100` .** in hectares
    - In the Production column, **Minimum production `0` and maximum production `125,08,00,000`.** in MT
- for object datatype
    - In the "State_Name" column, **there are `33` distinct states, with *Uttar Pradesh* being the most frequent, occurring `33,306` times.**
    - In the "District_Name" column, **there are `646` distinct states, with *BIJAPUR* being the most frequent, occurring `945` times.**
    - In the "season" column , **there are `6` seasons , `95951` times *Kharif* being the most frequent season.**
    - In the "Crop" column , **there are `124` unique crops , `15104` times *Rice* being the most frequent Crop.**

####  Data Checks to perform
-   remove the space from the features - Crop, State_name,Season  and replace with underscore (_)
-   Creating a new features from the State name to Zones and Crop to Crop categories.
-   Check for duplicate
-   Check for missing values
-   Check the number of unique values of each column
-   Check for outliers


##### "We are removing spaces from the feature names to facilitate better operations."

In [5]:
df["Crop"]=df["Crop"].str.strip().str.replace(" ","_").str.replace("[\(\)/]", "", regex=True)
df["State_Name"]=df["State_Name"].str.strip().str.replace(" ","_").str.replace("[\(\)/]", "", regex=True)
df["Season"]=df["Season"].str.strip().str.replace(" ","_").str.replace("[\(\)/]", "", regex=True)

##### Creating a new features Zone and Crop Categories from the State name and Crop.

In [6]:
north_india = ['Jammu_and_Kashmir', 'Punjab', 'Himachal_Pradesh', 'Haryana', 'Uttarakhand', 'Uttar_Pradesh', 'Chandigarh']
east_india = ['Bihar', 'Odisha', 'Jharkhand', 'West_Bengal']
south_india = ['Andhra_Pradesh', 'Karnataka', 'Kerala' ,'Tamil_Nadu', 'Telangana']
west_india = ['Rajasthan' , 'Gujarat', 'Goa','Maharashtra']
central_india = ['Madhya_Pradesh', 'Chhattisgarh']
north_east_india = ['Assam', 'Sikkim', 'Nagaland', 'Meghalaya', 'Manipur', 'Mizoram', 'Tripura', 'Arunachal_Pradesh']
ut_india = ['Andaman_and_Nicobar_Islands', 'Dadra_and_Nagar_Haveli', 'Puducherry']

In [7]:
def zone_assign(state):
    if state in north_india:
        return "north_india"
    elif state in east_india:
        return "east_india"
    elif state in south_india:
        return "south_india"
    elif state in west_india:
        return "west_india"
    elif state in central_india:
        return "central_india"
    elif state in north_east_india:
        return "north_east_india"
    elif state in ut_india:
        return "ut_india"
    


In [8]:
df['Zone']=df["State_Name"].apply(lambda x: zone_assign(x))

In [9]:
df.head()

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production,Zone
0,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0,ut_india
1,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Other_Kharif_pulses,2.0,1.0,ut_india
2,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0,ut_india
3,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Banana,176.0,641.0,ut_india
4,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Cashewnut,720.0,165.0,ut_india


Creating new features from existing features 
Crop Category from Crop


Cereal, Pulses,Fruits,Beans,Vegetables, Spices, fibers, Nuts, Natural Polymer,Coffee, Tea, Total foodgrain, Pulses, Oilseeds, Paddy, Commercial, Sugarcane, forage plants and Others

In [10]:
def categorize_crop(crop):
    cereals = ['Rice', 'Maize', 'Bajra', 'Jowar', 'Ragi', 'Wheat', 'Barley', 'Sorghum']
    pulses = ['ArharTur', 'MoongGreen_Gram', 'Urad', 'Gram', 'Masoor', 'Tur', 'Lentil']
    oilseeds = ['Groundnut', 'Sesamum', 'Sunflower', 'Rapeseed_&Mustard', 'Linseed', 'Castor_seed', 'Soyabean', 'Niger_seed', 'Safflower', 'Mustard']
    nuts = ['Groundnut', 'Cashewnut', 'Coconut']
    fruits = ['Banana', 'Mango', 'Grapes', 'Citrus_Fruit', 'Orange', 'Papaya', 'Pome_Fruit', 'Pome_Granet', 'Lemon', 'Apple', 'Peach', 'Pear', 'Plums', 'Litchi', 'Ber', 'Other_Fresh_Fruits']
    vegetables = ['Potato', 'Onion', 'Brinjal', 'Tomato', 'Bhindi', 'Cucumber', 'Cabbage', 'Cauliflower', 'Peas__vegetable', 'Peas_&_beans_Pulses', 'Beans_&_MutterVegetable', 'Other_Vegetables']
    spices = ['Black_pepper', 'Dry_chillies', 'Turmeric', 'Coriander', 'Garlic', 'Ginger', 'Cardamom', 'Cond-spcs_other']
    fiber = ['Cottonlint', 'Jute', 'Mesta']
    Sugarcane=['Sugarcane']
    Tobacco=['Tobacco']
    Tea=['Tea']
    Coffee=['Coffee']
    Rubber=['Rubber']
    
    if crop in cereals:
        return 'Cereals'
    elif crop in pulses:
        return 'Pulses'
    elif crop in oilseeds:
        return 'Oilseeds'
    elif crop in nuts:
        return 'Nuts'
    elif crop in fruits:
        return 'Fruits'
    elif crop in vegetables:
        return 'Vegetables'
    elif crop in spices:
        return 'Spices'
    elif crop in fiber:
        return 'Fiber'
    elif crop in Sugarcane:
        return 'Sugarcane'
    elif crop in Tobacco:
        return 'Tobacco'
    elif crop in Tea:
        return 'Tea'
    elif crop in Coffee:
        return 'Coffee'
    elif crop in Rubber:
        return 'Rubber'
    else:
        return 'Others'


In [11]:
df['Crop_Category'] = df['Crop'].apply(categorize_crop)

In [14]:
df.head()

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production,Zone,Crop_Category
0,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0,ut_india,Others
1,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Other_Kharif_pulses,2.0,1.0,ut_india,Others
2,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0,ut_india,Cereals
3,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Banana,176.0,641.0,ut_india,Fruits
4,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Cashewnut,720.0,165.0,ut_india,Nuts


### Check the Duplicate values 

In [15]:
df[df.duplicated]

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production,Zone,Crop_Category


there is no duplicate values 
### check the Missing values

In [16]:
# find the missing values 
df.isnull().sum()/len(df)*100

State_Name      0.00
District_Name   0.00
Crop_Year       0.00
Season          0.00
Crop            0.00
Area            0.00
Production      1.52
Zone            0.00
Crop_Category   0.00
dtype: float64

## Observations 
- in the *Production column 1.52 % values are missing.
- null values are very less so we can drop the null values 

In [17]:
Crop_df=df.dropna().reset_index(drop=True)

here we creating the new features Production per Area 

In [18]:
# we are creating the new column production per Area
Crop_df['Production/Area']= Crop_df['Production']/Crop_df['Area']


In [19]:
# Copy the data data_df
data_df=Crop_df.copy()

In [20]:
Crop_df.head()

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production,Zone,Crop_Category,Production/Area
0,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0,ut_india,Others,1.59
1,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Other_Kharif_pulses,2.0,1.0,ut_india,Others,0.5
2,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0,ut_india,Cereals,3.15
3,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Banana,176.0,641.0,ut_india,Fruits,3.64
4,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Cashewnut,720.0,165.0,ut_india,Nuts,0.23


###  Exploring Data
#### Check the unique values of each column

In [21]:
def col_analysis(data,x):
    print(f"Number of unique values in {x} column is - ",data[x].nunique())
    print(f"\nUnique values for {x} - \n",data[x].unique())
    print(f"\nValue Counts in {x} -\n", data[x].value_counts())


In [22]:
col_analysis(Crop_df,"State_Name")

Number of unique values in State_Name column is -  33

Unique values for State_Name - 
 ['Andaman_and_Nicobar_Islands' 'Andhra_Pradesh' 'Arunachal_Pradesh'
 'Assam' 'Bihar' 'Chandigarh' 'Chhattisgarh' 'Dadra_and_Nagar_Haveli'
 'Goa' 'Gujarat' 'Haryana' 'Himachal_Pradesh' 'Jammu_and_Kashmir'
 'Jharkhand' 'Karnataka' 'Kerala' 'Madhya_Pradesh' 'Maharashtra' 'Manipur'
 'Meghalaya' 'Mizoram' 'Nagaland' 'Odisha' 'Puducherry' 'Punjab'
 'Rajasthan' 'Sikkim' 'Tamil_Nadu' 'Telangana' 'Tripura' 'Uttar_Pradesh'
 'Uttarakhand' 'West_Bengal']

Value Counts in State_Name -
 State_Name
Uttar_Pradesh                  33189
Madhya_Pradesh                 22604
Karnataka                      21079
Bihar                          18874
Assam                          14622
Odisha                         13524
Tamil_Nadu                     13266
Maharashtra                    12496
Rajasthan                      12066
Chhattisgarh                   10368
West_Bengal                     9597
Andhra_Pradesh  

In [23]:
col_analysis(Crop_df,"District_Name")

Number of unique values in District_Name column is -  646

Unique values for District_Name - 
 ['NICOBARS' 'NORTH AND MIDDLE ANDAMAN' 'SOUTH ANDAMANS' 'ANANTAPUR'
 'CHITTOOR' 'EAST GODAVARI' 'GUNTUR' 'KADAPA' 'KRISHNA' 'KURNOOL'
 'PRAKASAM' 'SPSR NELLORE' 'SRIKAKULAM' 'VISAKHAPATANAM' 'VIZIANAGARAM'
 'WEST GODAVARI' 'ANJAW' 'CHANGLANG' 'DIBANG VALLEY' 'EAST KAMENG'
 'EAST SIANG' 'KURUNG KUMEY' 'LOHIT' 'LONGDING' 'LOWER DIBANG VALLEY'
 'LOWER SUBANSIRI' 'NAMSAI' 'PAPUM PARE' 'TAWANG' 'TIRAP' 'UPPER SIANG'
 'UPPER SUBANSIRI' 'WEST KAMENG' 'WEST SIANG' 'BAKSA' 'BARPETA'
 'BONGAIGAON' 'CACHAR' 'CHIRANG' 'DARRANG' 'DHEMAJI' 'DHUBRI' 'DIBRUGARH'
 'DIMA HASAO' 'GOALPARA' 'GOLAGHAT' 'HAILAKANDI' 'JORHAT' 'KAMRUP'
 'KAMRUP METRO' 'KARBI ANGLONG' 'KARIMGANJ' 'KOKRAJHAR' 'LAKHIMPUR'
 'MARIGAON' 'NAGAON' 'NALBARI' 'SIVASAGAR' 'SONITPUR' 'TINSUKIA'
 'UDALGURI' 'ARARIA' 'ARWAL' 'AURANGABAD' 'BANKA' 'BEGUSARAI' 'BHAGALPUR'
 'BHOJPUR' 'BUXAR' 'DARBHANGA' 'GAYA' 'GOPALGANJ' 'JAMUI' 'JEHANABAD'
 'KAIMUR

In [24]:
col_analysis(Crop_df,"Season")

Number of unique values in Season column is -  6

Unique values for Season - 
 ['Kharif' 'Whole_Year' 'Autumn' 'Rabi' 'Summer' 'Winter']

Value Counts in Season -
 Season
Kharif        94283
Rabi          66160
Whole_Year    56127
Summer        14811
Winter         6050
Autumn         4930
Name: count, dtype: int64


In [25]:
col_analysis(Crop_df,"Crop")

Number of unique values in Crop column is -  124

Unique values for Crop - 
 ['Arecanut' 'Other_Kharif_pulses' 'Rice' 'Banana' 'Cashewnut' 'Coconut'
 'Dry_ginger' 'Sugarcane' 'Sweet_potato' 'Tapioca' 'Black_pepper'
 'Dry_chillies' 'other_oilseeds' 'Turmeric' 'Maize' 'MoongGreen_Gram'
 'Urad' 'ArharTur' 'Groundnut' 'Sunflower' 'Bajra' 'Castor_seed'
 'Cottonlint' 'Horse-gram' 'Jowar' 'Korra' 'Ragi' 'Tobacco' 'Gram' 'Wheat'
 'Masoor' 'Sesamum' 'Linseed' 'Safflower' 'Onion' 'other_misc._pulses'
 'Samai' 'Small_millets' 'Coriander' 'Potato' 'Other__Rabi_pulses'
 'Soyabean' 'Beans_&_MutterVegetable' 'Bhindi' 'Brinjal' 'Citrus_Fruit'
 'Cucumber' 'Grapes' 'Mango' 'Orange' 'other_fibres' 'Other_Fresh_Fruits'
 'Other_Vegetables' 'Papaya' 'Pome_Fruit' 'Tomato' 'Mesta' 'CowpeaLobia'
 'Lemon' 'Pome_Granet' 'Sapota' 'Cabbage' 'Rapeseed_&Mustard'
 'Peas__vegetable' 'Niger_seed' 'Bottle_Gourd' 'Varagu' 'Garlic' 'Ginger'
 'Oilseeds_total' 'Pulses_total' 'Jute' 'Peas_&_beans_Pulses' 'Blackgram'
 'Paddy'

In [26]:
col_analysis(Crop_df,"Crop_Year")

Number of unique values in Crop_Year column is -  19

Unique values for Crop_Year - 
 [2000 2001 2002 2003 2004 2005 2006 2010 1997 1998 1999 2007 2008 2009
 2011 2012 2013 2014 2015]

Value Counts in Crop_Year -
 Crop_Year
2003    17139
2002    16536
2007    14269
2008    14230
2006    13976
2004    13858
2010    13793
2011    13791
2009    13767
2000    13553
2005    13519
2013    13475
2001    13293
2012    13184
1999    12441
1998    11262
2014    10815
1997     8899
2015      561
Name: count, dtype: int64


In [27]:
col_analysis(Crop_df,"Production")

Number of unique values in Production column is -  51627

Unique values for Production - 
 [2.00000e+03 1.00000e+00 3.21000e+02 ... 7.29553e+05 7.30136e+05
 5.97899e+05]

Value Counts in Production -
 Production
1.00              4028
0.00              3523
100.00            3521
2.00              2964
3.00              2311
                  ... 
212,000,000.00       1
1.07                 1
229,341.00           1
18,706.00            1
597,899.00           1
Name: count, Length: 51627, dtype: int64


In [28]:
col_analysis(Crop_df,"Area")

Number of unique values in Area column is -  38391

Unique values for Area - 
 [1.25400e+03 2.00000e+00 1.02000e+02 ... 3.02274e+05 1.14930e+04
 2.79151e+05]

Value Counts in Area -
 Area
1.00          3573
2.00          3140
100.00        2621
3.00          2478
4.00          2182
              ... 
25,569.00        1
19,349.00        1
90,302.00        1
39,698.00        1
279,151.00       1
Name: count, Length: 38391, dtype: int64


In [29]:
print("min value",Crop_df['Production/Area'].min())
print("max value",Crop_df['Production/Area'].max())
col_analysis(Crop_df,"Production/Area")

min value 0.0
max value 88000.0
Number of unique values in Production/Area column is -  144943

Unique values for Production/Area - 
 [ 1.59489633  0.5         3.14705882 ...  0.738437   50.15432099
  2.14184796]

Value Counts in Production/Area -
 Production/Area
1.00    7096
0.50    4051
0.00    3523
0.33    1887
2.00    1675
        ... 
0.21       1
1.38       1
1.65       1
1.03       1
2.14       1
Name: count, Length: 144943, dtype: int64


In [30]:
# List of that Crops which sum Production is ZERO.
A1=pd.DataFrame(df.groupby("Crop")['Production'].sum()==0).reset_index()
A1[A1["Production"]==True]['Crop'].values

array(['Apple', 'Ash_Gourd', 'Beet_Root', 'Ber', 'Cucumber', 'Lab-Lab',
       'Litchi', 'Other_Citrus_Fruit', 'Other_Dry_Fruit', 'Peach', 'Pear',
       'Peas__vegetable', 'Plums', 'Pump_Kin', 'Ribed_Guard',
       'Snak_Guard', 'Water_Melon', 'Yam', 'other_fibres'], dtype=object)

In [31]:
Crop_df=Crop_df[Crop_df['Production']!=0].reset_index(drop=True)
data_df1=Crop_df.copy()
data_df1.head()

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production,Zone,Crop_Category,Production/Area
0,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0,ut_india,Others,1.59
1,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Other_Kharif_pulses,2.0,1.0,ut_india,Others,0.5
2,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0,ut_india,Cereals,3.15
3,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Banana,176.0,641.0,ut_india,Fruits,3.64
4,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Cashewnut,720.0,165.0,ut_india,Nuts,0.23


In [32]:
drop=[1997,2015]
Crop_df=Crop_df[~Crop_df["Crop_Year"].isin(drop)].reset_index(drop=True)
Crop_df.head()

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production,Zone,Crop_Category,Production/Area
0,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0,ut_india,Others,1.59
1,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Other_Kharif_pulses,2.0,1.0,ut_india,Others,0.5
2,Andaman_and_Nicobar_Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0,ut_india,Cereals,3.15
3,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Banana,176.0,641.0,ut_india,Fruits,3.64
4,Andaman_and_Nicobar_Islands,NICOBARS,2000,Whole_Year,Cashewnut,720.0,165.0,ut_india,Nuts,0.23


In [33]:
# crop_limit=[]
# for i in Crop_df['Crop'].unique():
#     data=Crop_df[Crop_df['Crop']==i]['Production/Area']
#     Q1=data.quantile(0.25)
#     Q3=data.quantile(0.75)
#     IQR=Q3-Q1
#     low=Q1-(IQR*3)
#     upp=Q3+(IQR*3)
#     crop_limit.append({"Crop_name": i,"lower_limit": low,"upper_limit": upp})

In [34]:
# crop_limit=pd.DataFrame(crop_limit)
# crop_limit

In [35]:
# Crop_df2 = pd.DataFrame()
# for crop in Crop_df['Crop'].unique():
#     lower_limit = crop_limit[crop_limit['Crop_name'] == crop]["lower_limit"].values[0]
#     upper_limit = crop_limit[crop_limit['Crop_name'] == crop]["upper_limit"].values[0]
#     filtered_data = Crop_df[(Crop_df["Crop"] == crop) & (Crop_df["Production/Area"] >= lower_limit) & (Crop_df["Production/Area"] <= upper_limit)]
#     Crop_df2 = pd.concat([Crop_df2, filtered_data])

# Crop_df2.reset_index(drop=True, inplace=True)
# Crop_df2


In [36]:
# Crop_df3 = pd.DataFrame()
# for crop in data_df1['Crop'].unique():
#     filtered_crop_limit = crop_limit[crop_limit['Crop_name'] == crop]
#     if not filtered_crop_limit.empty:
#         lower_limit = filtered_crop_limit["lower_limit"].values[0]
#         upper_limit = filtered_crop_limit["upper_limit"].values[0]
#         filtered_data = data_df1[(data_df1["Crop"] == crop) & (data_df1["Production/Area"] >= lower_limit) & (data_df1["Production/Area"] <= upper_limit)]
#         Crop_df3 = pd.concat([Crop_df3, filtered_data])
        
# Crop_df3.reset_index(drop=True, inplace=True)
# Crop_df3


In [37]:
# Crop_year_counts=Crop_df.groupby('Crop')['Crop_Year'].nunique().reset_index()
# Crop_year_counts.columns = ['Crop', 'Unique_Years_Count']
# Crop_min_years_10 = Crop_year_counts[Crop_year_counts['Unique_Years_Count'] > 10].reset_index(drop=True)

# ten_years_crops=Crop_min_years_10.Crop.values
# min_ten_years_crops=pd.DataFrame(ten_years_crops,columns=["crops"])
# min_ten_years_crops.head()

In [38]:
# Save crop_df (dropping values from 1997 and 2015 and dropping production values 0)
Crop_df.to_csv('Crop_data_filtered.csv', index=False)

# # Save crop_df2 (dropping outliers for each crop from crop_df)
# Crop_df2.to_csv('Crop_data_filtered_outliers_removed.csv', index=False)

# # Save crop_df3 (dropping outliers for each crop from data_df1)
# Crop_df3.to_csv('Crop_data_missing_values_dropped_outliers_removed.csv', index=False)



In [39]:
# Crop_dataset_names=[]
# for i in Crop_df3["Crop"].unique():
#     data=Crop_df3[Crop_df3["Crop"]==i].reset_index(drop=True)
#     globals()["df_"+i]=data
#     Crop_dataset_names.append("df_"+i)
    

In [40]:
# Season_dataset_names=[]
# for i in Crop_df3["Season"].unique():
#     data=Crop_df3[Crop_df3["Season"]==i].reset_index(drop=True)
#     globals()["df_"+i]=data
#     Season_dataset_names.append("df_"+i)

In [41]:
# State_Name_dataset_names=[]
# for i in Crop_df3["State_Name"].unique():
#     data=Crop_df3[Crop_df3["State_Name"]==i].reset_index(drop=True)
#     globals()["df_"+i]=data
#     State_Name_dataset_names.append("df_"+i)

In [42]:
# Crop_Year_dataset_names=[]
# for i in Crop_df3["Crop_Year"].unique():
#     data=Crop_df3[Crop_df3["Crop_Year"]==i].reset_index(drop=True)
#     globals()["df_"+str(i)]=data
#     Crop_Year_dataset_names.append("df_"+str(i))

In [43]:
# datasets_names=Crop_dataset_names+Season_dataset_names+State_Name_dataset_names+Crop_Year_dataset_names

In [44]:
# len(datasets_names)

In [45]:
# import os

# if not os.path.exists("datasets"):
#     os.makedirs("datasets")

# for dataset_name in datasets_names:
#     dataset = globals()[dataset_name]
#     dataset.to_csv(f"datasets/{dataset_name}.csv", index=False)


1. df = Original data
2. data_df = Missing_values_dropped
3. data_df1 =  droping production values 0
4. Crop_df = (Crop_data_filtered) dropping values from 1997 and 2015 and droping production values 0
5. Crop_df2 = (Crop_data_filtered_outliers_removed) dropping outliers for each crop (from crop_df)
6. Crop_df3 = (Crop_data_missing_values_dropped_outliers_removed) dropping outliers for each crop (from data_df1)
7. ten_years_crops

#### Creating these datasets required a significant amount of storage space, enabling us to derive optimal results from all the experiments conducted.