# 2. Acquire the Data

**[NHRDF](http://nhrdf.org/en-us/)** - This is the website of National Horticultural Research & Development Foundation and maintains a database on Market Arrivals and Price, Area and Production and Export Data for three commodities - Garlic, Onion and Potatoes. We are in luck! It also has data from 1996 onwards and has only got one form to fill to get the data in a tabular form. Further it also has production and export data. Excellent. Lets use this. Here is the best link to got to get all that is available - http://nhrdf.org/en-us/DatabaseReports

## Scraping the Data
### Scraping - Manual Form Filling

So let us fill the form to get data and test our scraping process.( http://nhrdf.org/en-us/MonthWiseMarketArrivals ). 

- Crop Name: Onion
- Month: Jan
- Market: All
- Year: 2016



## Import package and load html file

In [5]:
import pandas as pd

In [6]:
html = pd.read_html("MonthWiseMarketArrivalsJan2016.html")

## Find all the Tables

In [7]:
# Number of tables
len(html)

5

In [8]:
type(html)

list

## Get the exact table

To read the exact table we need to pass in an identifier value which would identify the table. We can use the attrs parameter in read_html to do so. The parameter we will pass is the id variable


In [9]:
# So can we read our exact table
Table_1 = pd.read_html('MonthWiseMarketArrivalsJan2016.html', 
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})

In [10]:
# So how many tables have we got now
len(Table_1)

1

In [11]:
Table_1[0].head()

Unnamed: 0,0,1,2,3,4,5,6
0,Market,Month Name,Year,Arrival (q),Price Minimum (Rs/q),Price Maximum (Rs/q),Modal Price (Rs/q)
1,AGRA(UP),January,2016,134200,1039,1443,1349
2,AHMEDABAD(GUJ),January,2016,198390,646,1224,997
3,AHMEDNAGAR(MS),January,2016,208751,175,1722,1138
4,AJMER(RAJ),January,2016,4247,722,1067,939


However, we have not got the header correctly in our dataframe. Let us see if we can fix this.

In [12]:
Table_1 = pd.read_html('MonthWiseMarketArrivalsJan2016.html', header=0,
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})
Table_1[0].head()

Unnamed: 0,Market,Month Name,Year,Arrival (q),Price Minimum (Rs/q),Price Maximum (Rs/q),Modal Price (Rs/q)
0,AGRA(UP),January,2016,134200,1039,1443,1349
1,AHMEDABAD(GUJ),January,2016,198390,646,1224,997
2,AHMEDNAGAR(MS),January,2016,208751,175,1722,1138
3,AJMER(RAJ),January,2016,4247,722,1067,939
4,ALIGARH(UP),January,2016,12350,1219,1298,1257


## Dataframe 

In [13]:
# for convention
df = Table_1[0]

In [14]:
# shape of dataframe
df.shape

(84, 7)

In [15]:
#Getting unique values of State
pd.unique(df['Market'])

array(['AGRA(UP)', 'AHMEDABAD(GUJ)', 'AHMEDNAGAR(MS)', 'AJMER(RAJ)',
       'ALIGARH(UP)', 'ALWAR(RAJ)', 'AMRITSAR(PB)', 'BALLIA(UP)',
       'BANGALORE', 'BAREILLY(UP)', 'BELGAUM(KNT)', 'BHATINDA(PB)',
       'BHAVNAGAR(GUJ)', 'BHUBNESWER(OR)', 'BIJAPUR(KNT)', 'BURDWAN(WB)',
       'CHAKAN(MS)', 'CHANDIGARH', 'CHANDVAD(MS)', 'CHENNAI',
       'DEESA(GUJ)', 'DEHRADOON(UTT)', 'DELHI', 'DEVALA(MS)',
       'DHAVANGERE(KNT)', 'DHULIA(MS)', 'GONDAL(GUJ)', 'GUWAHATI',
       'HASSAN(KNT)', 'HOSHIARPUR(PB)', 'HUBLI(KNT)', 'HYDERABAD',
       'INDORE(MP)', 'JAIPUR', 'JALANDHAR(PB)', 'JALGAON(MS)', 'JAMMU',
       'JAMNAGAR(GUJ)', 'JODHPUR(RAJ)', 'KALVAN(MS)', 'KANPUR(UP)',
       'KARNAL(HR)', 'KHANNA(PB)', 'KOLHAPUR(MS)', 'KOLKATA',
       'KOPERGAON(MS)', 'KOTA(RAJ)', 'KURNOOL(AP)', 'LASALGAON(MS)',
       'LONAND(MS)', 'LUCKNOW', 'MAHUVA(GUJ)', 'MALEGAON(MS)',
       'MANMAD(MS)', 'MUMBAI', 'NAGPUR', 'NEWASA(MS)', 'NIPHAD(MS)',
       'PALAYAM(KER)', 'PATIALA(PB)', 'PATNA', 'PHALTAN (MS)',

In [16]:
# Change the column names to simpler ones
df.columns = ['market', 'month', 'year', 'quantity', 'priceMin', 'priceMax', 'priceMod']

In [17]:
df.head()

Unnamed: 0,market,month,year,quantity,priceMin,priceMax,priceMod
0,AGRA(UP),January,2016,134200,1039,1443,1349
1,AHMEDABAD(GUJ),January,2016,198390,646,1224,997
2,AHMEDNAGAR(MS),January,2016,208751,175,1722,1138
3,AJMER(RAJ),January,2016,4247,722,1067,939
4,ALIGARH(UP),January,2016,12350,1219,1298,1257


## Save the dataframe to a csv file

In [18]:
df.to_csv('MonthWiseMarketArrivalsJan2016.html.csv', index = False)