# Introduction to Data Science

## Introduction to a few basic functions

In [1]:
import numpy as np
import pandas as pd

In [2]:
randn = np.random.rand  # We will be using it often

### Basic Functionality

#### Head and Tail

In [3]:
long_series = pd.DataFrame(randn(1000))

To view a small sample of a Series or DataFrame object, use the **head()** and **tail()** methods. The default number of elements to display is five, but you may pass a custom number.

In [4]:
long_series.head()

Unnamed: 0,0
0,0.283956
1,0.301206
2,0.154645
3,0.029701
4,0.67556


In [5]:
long_series.tail(6)

Unnamed: 0,0
994,0.673974
995,0.548096
996,0.027773
997,0.801471
998,0.897445
999,0.528823


**Attributes and the raw values**

Pandas objects have a number of attributes enabling you to access the metadata

- shape: gives the axis dimensions of the object, consistent with ndarray

**Axis labels**

- Series: index (only axis)
- DataFrame: index (rows) and columns


In [6]:
df = pd.DataFrame(np.random.randn(10, 4),columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,1.446542,1.753041,0.596543,0.864337
1,-0.253604,-0.017301,1.215406,1.094898
2,0.395122,-1.016751,0.726448,-1.293052
3,-1.494603,-1.935372,1.322872,1.217552
4,-0.382946,-0.264188,1.013857,0.364941
5,-0.629871,0.049954,0.742775,-1.080868
6,0.283878,-2.546868,-0.473586,-0.818646
7,1.109055,0.35854,0.71631,-0.45716
8,0.760719,-0.082272,-0.132923,-1.612044
9,0.202033,-0.572644,0.63833,-0.029501


In [7]:
df.shape

(10, 4)

In [8]:
df.values

array([[ 1.44654173,  1.7530407 ,  0.59654337,  0.86433689],
       [-0.25360363, -0.01730117,  1.21540558,  1.09489779],
       [ 0.39512246, -1.01675073,  0.72644814, -1.29305186],
       [-1.49460296, -1.93537236,  1.32287198,  1.21755249],
       [-0.38294649, -0.26418755,  1.01385665,  0.36494105],
       [-0.62987066,  0.049954  ,  0.74277539, -1.08086833],
       [ 0.28387786, -2.54686751, -0.47358634, -0.81864577],
       [ 1.1090553 ,  0.35854001,  0.71631003, -0.45715975],
       [ 0.7607188 , -0.08227156, -0.13292281, -1.61204405],
       [ 0.20203336, -0.57264435,  0.63832969, -0.02950099]])

In [9]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [10]:
df.mean(axis = 0) # 0 corresponds to column here

A    0.143633
B   -0.427386
C    0.636603
D   -0.174954
dtype: float64

In [11]:
df.mean(axis = 1) # 1 corresponds to rows here

0    1.165116
1    0.509850
2   -0.297058
3   -0.222388
4    0.182916
5   -0.229502
6   -0.888805
7    0.431686
8   -0.266630
9    0.059554
dtype: float64

**Summary of data**

There is a convenient ***describe()*** function which computes a variety of summary statistics about a Series or the columns of a DataFrame.


The ***idxmin()*** and ***idxmax()*** functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

**Note : ** *When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index*

In [12]:
df.describe()

Unnamed: 0,A,B,C,D
count,10.0,10.0,10.0,10.0
mean,0.143633,-0.427386,0.636603,-0.174954
std,0.870118,1.204085,0.557803,1.031417
min,-1.494603,-2.546868,-0.473586,-1.612044
25%,-0.350611,-0.905724,0.60699,-1.015313
50%,0.242956,-0.17323,0.721379,-0.24333
75%,0.66932,0.03314,0.946086,0.739488
max,1.446542,1.753041,1.322872,1.217552


In [13]:
df.idxmin()

A    3
B    6
C    6
D    8
dtype: int64

In [14]:
df.idxmax()

A    0
B    0
C    3
D    3
dtype: int64

#### Value counts or Histogramming

The ***value_counts()*** computes a histogram of a 1D array of values.
It's a Series method.

Let's take a random integers between 0 and 6, and get the histogram of the Series dataStructure of the same data.


In [15]:
pd.Series(np.random.randint(0,6,size=60)).value_counts()

2    16
1    15
4    10
5     9
3     7
0     3
dtype: int64

In [16]:
pd.Series(np.random.randint(0,6,size=60)).mode()

0    5
dtype: int64

### Sorting by index and value

There are two obvious kinds of sorting that you may be interested in: sorting by label and sorting by actual values. The primary method for sorting axis labels (indexes) across data structures is the **sort_index()** method.

**DataFrame.sort_index()** can accept an optional by argument for ***axis=0*** which will use an arbitrary vector or a column name of the DataFrame to determine the sort order.

But 
- sort_values is meant to sort by the values of columns
- sort_index is meant to sort by the index labels (or a specific level of the index, or the column labels when axis=1)


Let's take an easier dataset.

In [17]:
df = pd.DataFrame(np.random.randint(6,size=(6,3)),columns = ['First', 'Second', 'Third'],
                 index=['Zero','One','Two','Three','Four','Five'])

In [18]:
df.sort_index(ascending=False)

Unnamed: 0,First,Second,Third
Zero,4,4,2
Two,4,0,3
Three,1,5,3
One,4,0,1
Four,4,4,4
Five,3,4,4


In [19]:
df.sort_index(ascending=False,axis=1)

Unnamed: 0,Third,Second,First
Zero,2,4,4
One,1,0,4
Two,3,0,4
Three,3,5,1
Four,4,4,4
Five,4,4,3


In [20]:
# Sorting by values inside the DataFrame

df.sort_values(by = ['First','Second'])

Unnamed: 0,First,Second,Third
Three,1,5,3
Five,3,4,4
One,4,0,1
Two,4,0,3
Zero,4,4,2
Four,4,4,4


### Indexing and selecting data

#### Different Choices for Indexing

Pandas supports three types of multi-axis indexing.

- **.loc** is primarily label based, but may also be used with a boolean array. **.loc** will raise KeyError when the items are not found. Allowed inputs are:

 - A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer  position along the index)
 - A list or array of labels ['a', 'b', 'c']
 - A slice object with labels 'a':'f', (note that contrary to usual python slices, both the start and the stop are included!)
 - A boolean array

- **.iloc** is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. **.iloc** will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. Allowed inputs are:

 - An integer e.g. 5
 - A list or array of integers [4, 3, 0]
 - A slice object with ints 1:7
 - A boolean array

- **.ix** supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. **.ix** is the most general and will support any of the inputs in **.loc** and **.iloc**. **.ix** also supports floating point label schemes. **.ix** is exceptionally useful when dealing with mixed positional and label based hierachical indexes.

However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

**Warning:** Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

In [21]:
# Error - df.loc[2] 
df.loc[['One','Three']][['Third']]

Unnamed: 0,Third
One,1
Three,3


In [22]:
df.iloc[1:3,0:2]

Unnamed: 0,First,Second
One,4,0
Two,4,0


In [23]:
df.iloc[[4,3,5],0:2]

Unnamed: 0,First,Second
Four,4,4
Three,1,5
Five,3,4


In [24]:
df.loc['One':'Three','Second':'Third']

Unnamed: 0,Second,Third
One,0,1
Two,0,3
Three,5,3


#### More functions can be learnt after importing the dataset