<h1>Introduction To Data Science With Pandas and Numpy</h1>
<p><b>Data Science</b> or <b>Data Analytics</b> is a process of analyzing large set of data points to get answers on questions related to that data.. <br>And <b>Pandas</b> and <b>Numpy</b> are two Python Modules that makes data science very easy and effective for us to explore.</p>

In [2]:
import pandas as pd

<h2>Data Munging or Data Wrangling</h2><br>
<p>Process of cleaning messy data or getting our data groomed for further processing is called Data Munging or Data Wrangling. We can do data cleaning with predefined functions/methods available in Pandas such as fillna(). </p>

<h1><u>Data Frames in Pandas</u></h1>
<p><b>Data Frame</b> is a main object in Pandas. It is used to represent our data with rows and columns(tabular form or like a excel spreadsheet).</p>

In [5]:
df = pd.read_csv("weather_data.csv")

In [6]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


<p>We can read our data using python Dictionary as well, as shown:</p>

In [7]:
weather_dict = {
    'day': ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017'],
    'temperature': [32,35,28,24,32,31],
    'windspeed' : [6,7,2,7,4,2],
    'event' : ['Rain','Sunny','Snow','Snow','Rain','Sunny']
}
df = pd.DataFrame(weather_dict)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


<p>Observe that both the outputs are same...!!</p>

In [10]:
df.shape #to get number of rows and cols in a dataframe

(6, 4)

In [11]:
rows, cols = df.shape
print("Number of Rows in our DataFrame is = ", rows)
print("Number of Columns in our DataFrame is = ", cols)

Number of Rows in our DataFrame is =  6
Number of Columns in our DataFrame is =  4


In [13]:
#head() and tail()
#head() --> it gives us the only some starting rows of the dataFrame in case your data frame is very large...
#tail() --> it gives us the only last 5 rows of the dataFrame...
#for example,
df.head() #this gives us first 5 rows of the data frame

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [15]:
df.tail() #this gives us the last 5 roes of the data frame

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [16]:
#we can also pass number of rows we want to see to the tail function
df.tail(2)

Unnamed: 0,day,temperature,windspeed,event
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [17]:
#we can also use slicing in our dataframe object(df here) to get desired number of rows as shown below:
df[1:5]   #1 index included but 5 is not included

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [18]:
#to get idea of number of columns, we use df.columns
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [19]:
#to get an data of particular column in our dataframe we use df.col_name
df.day

0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object

In [20]:
df.windspeed

0    6
1    7
2    2
3    7
4    4
5    2
Name: windspeed, dtype: int64

In [21]:
df.event

0     Rain
1    Sunny
2     Snow
3     Snow
4     Rain
5    Sunny
Name: event, dtype: object

In [23]:
#so we can do that (df.col_name) or we can use the dictionary property syntax as shown:
df['temperature']

0    32
1    35
2    28
3    24
4    32
5    31
Name: temperature, dtype: int64

In [25]:
df['event']

0     Rain
1    Sunny
2     Snow
3     Snow
4     Rain
5    Sunny
Name: event, dtype: object

In [26]:
#to see the type of our column we use the command as 
print(type(df['event']))

<class 'pandas.core.series.Series'>


In [27]:
#to print selective columns we write like this
df[['day','temperature','windspeed']]

Unnamed: 0,day,temperature,windspeed
0,1/1/2017,32,6
1,1/2/2017,35,7
2,1/3/2017,28,2
3,1/4/2017,24,7
4,1/5/2017,32,4
5,1/6/2017,31,2


<p>Let us do some <b>Operations</b> on our data.</p>
<p>First of all let us find the maximum value of <em><i>temperature and windspeed</i></em></p>

In [28]:
df['temperature'].max()

35

In [29]:
df['windspeed'].max()

7

<p>Similarly, we can find out the minimum value as follows:</p>

In [32]:
print("Minimum Temperature recorded = ", df['temperature'].min())
print("Minimum Windspeed recorded = ", df['windspeed'].min())

Minimum Temperature recorded =  24
Minimum Windspeed recorded =  2


In [33]:
# we can use decribe() to get a basic statistical analysisof our dataframe and all the statistics operations are applied 
# numeric(integer, float) data only!!
df.describe() # this gives us mean, count of number of rows, standard deviation, mini, maxi, and percentiles(25%,50%,75%)..

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,30.333333,4.666667
std,3.829708,2.33809
min,24.0,2.0
25%,28.75,2.5
50%,31.5,5.0
75%,32.0,6.75
max,35.0,7.0


In [34]:
# also we can select specific rows based upon our conditions such as 
# print the rows where temperature is greater than mean temperature..... So we do like this:
df[df.temperature>df['temperature'].mean()]

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [35]:
# similarly we can print specific columns with some specified test conditions as well
# for example print the day, event and windspeed when temperature is greater than mean temperature.....
df[['day','event','windspeed']][df.temperature>df['temperature'].mean()]

Unnamed: 0,day,event,windspeed
0,1/1/2017,Rain,6
1,1/2/2017,Sunny,7
4,1/5/2017,Rain,4
5,1/6/2017,Sunny,2


In [None]:
# so observe the output of the above two operations and find the similarity in them.... 