# Introduction to Data Science

## Introduction to a few Data Structures

We are going to look at two among the DataStructures that Pandas operates with - Series and Dataframes.
We start by importing NumPy and Pandas using their conventional short names:

In [1]:
import numpy as np
import pandas as pd

In [2]:
randn = np.random.rand  # We will be using it often

### Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. 

The basic call can be like :

> series = Series(data, index=index)

The first mandatory argument can be

-  array-like
-  dictionary
-  scalar

#### Array-like

In [3]:
series = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
series

a    0.615294
b    0.745649
c    0.711277
d    0.199578
e    0.385222
dtype: float64

In [4]:
series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

**Note :** If data is an array-like, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

#### As dictionary

In [5]:
dictionary = {'a': 0, 'b': 1, 'c':2}
series = pd.Series(dictionary)
series

a    0
b    1
c    2
dtype: int64

#### As Scalar

In [6]:
series = pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])
series

a    5
b    5
c    5
d    5
e    5
dtype: int64

In [7]:
series = pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])
series

a    5
b    5
c    5
d    5
e    5
dtype: int64

### Simalilarity of Series to array and dictionary

In [8]:
series = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
series

a    0.075563
b    0.135883
c    0.420993
d    0.525824
e    0.249337
dtype: float64

In [9]:
# Array like 
print(series[0],'\n')
print(series[:2],'\n')
print(series[[3,2,4]],'\n')
print(series[series>series.mean()],'\n')

0.0755625561122 

a    0.075563
b    0.135883
dtype: float64 

d    0.525824
c    0.420993
e    0.249337
dtype: float64 

c    0.420993
d    0.525824
dtype: float64 



**Note : ** Index is also sliced and always remain a part of a data container.

In [10]:
# Dictionary like
print(series['b'],'\n')
print('e' in series)

0.135882847198 

True


### DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Like Series, DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- A Series
- Another DataFrame

In [11]:
# Dict

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [12]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [13]:
df.index, df.columns

(Index(['a', 'b', 'c', 'd'], dtype='object'),
 Index(['one', 'two'], dtype='object'))

In [14]:
# From dict of array-likes
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [15]:
# From a list of dicts

data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [16]:
pd.DataFrame(data,index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


### Basic Functionality

#### Head and Tail

In [17]:
long_series = pd.DataFrame(randn(1000))

To view a small sample of a Series or DataFrame object, use the **head()** and **tail()** methods. The default number of elements to display is five, but you may pass a custom number.

In [18]:
long_series.head()

Unnamed: 0,0
0,0.699867
1,0.741703
2,0.859944
3,0.852917
4,0.45707


In [19]:
long_series.tail(6)

Unnamed: 0,0
994,0.135869
995,0.378166
996,0.08606
997,0.666109
998,0.796589
999,0.908928
