# Objective: Understanding Series & DataFrame
**Topics**
* The **Series** Data Structure
    * Creating series
    * Querying series **(loc / iloc)**
    * Some basic series operations
* The **DataFrame** data structure
    * Creating DataFrame
    * Accessing data in DataFrames
    * Droping data (**.drop(), del**)
    * Adding a new column
    * Appending one DF to another

# **The Series Data Structure**
* The series is one of the core data structures in pandas
* Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index
* It can be thought as a cross between a list and a dictionary
* Syntax for creating series: ** pd.Series(data, index = index)**; data can be anything, index is a list of axis labels
* Same number of index values must be passed as num of observation in data
* If no index is passed an index of **0 to n-1** will be assigned

In [9]:
import numpy as np
# Import Library Pandas
import pandas as pd  
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.518971
b    0.768881
c    1.771436
d    0.992556
e    1.074037
dtype: float64

In [10]:
s.index

Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [11]:
# series from python dict, it automatically converts key to index
dict1 = {'1': 'apple',
          '2': 'book',
          '3': 'cat',
          '4': 'dog'}
s = pd.Series(dict1)
s

1    apple
2     book
3      cat
4      dog
dtype: object

In [12]:
s.index

Index([u'1', u'2', u'3', u'4'], dtype='object')

In [13]:
# if index does not match with key it will assign a NaN value to it
s1 = pd.Series(dict1, index = ['1','2','5'])
s1

1    apple
2     book
5      NaN
dtype: object

**Querying Series**
* A pandas Series can be queried, either by the index position or the index label.
* If no index is provided, the position and the label are effectively the same values
* To query by **numeric location**, starting at zero, we use the **iloc** attribute
* To query by the **index label**, you can use the **loc** attribute
* A series with index can be queried using both

In [14]:
# query by numeric location
dict1 = {'one': 'apple',
          'two': 'book',
          'three': 'cat',
          'four': 'dog'}
s = pd.Series(dict1)
print s.iloc[0]
print s.iloc[1]
print s.iloc[2]
print s.iloc[3]
# when we query single observation python can query without iloc, incase we want to query a list of observation it may fail
print s[0]
print s[1]
print s[2]
print s[3]

dog
apple
cat
book
dog
apple
cat
book


In [15]:
s.iloc[[0,1,2]]

four       dog
one      apple
three      cat
dtype: object

In [16]:
#query by the index label
s.loc['one']

'apple'

In [17]:
s.loc[['one','two']]

one    apple
two     book
dtype: object

In [18]:
# adding a new element to series, type doesn't matter
s.loc['five'] = 'elephant'
s
s.loc['1']= 5.90
s

four          dog
one         apple
three         cat
two          book
five     elephant
1             5.9
dtype: object

In [19]:
# series with duplicate keys
s1 = pd.Series(['a', 'A', 'B','b'], index = ['1','1','2','2'])
s1

1    a
1    A
2    B
2    b
dtype: object

In [20]:
s1.loc['1']

1    a
1    A
dtype: object

In [21]:
#Basic Series operations
#this creates a big series of random numbers
s = pd.Series(np.random.randint(0,1000,10000))
print s.head()    # shows top 5 observations
len(s)            # gives length of s
total = np.sum(s) # sum of the series using numpy sum function
s +=2             # adds two to each item in s using broadcasting
s.head(3)

0     84
1    521
2    563
3    383
4    700
dtype: int32


0     86
1    523
2    565
dtype: int32

In [22]:
# appending one series to another
dict1 = {'one': 'apple',
          'two': 'book',
          'three': 'cat',
          'four': 'dog'}
s = pd.Series(dict1)
s2 = s.append(s1)
s2

four       dog
one      apple
three      cat
two       book
1            a
1            A
2            B
2            b
dtype: object

# **The DataFrame data structure**
* Primary object of pandas library
* The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label
* indicies or column label could be non unique

In [41]:
# creating a dataframe using multiple series
s10 = pd.Series(np.random.randint(0,10,5), index = ['one','two','three','four','five'])
s11 = pd.Series(np.random.randint(0,100,5), index = ['one','two','three','four','five'])
s12 = pd.Series(np.random.randint(0,1,5), index = ['one','two','three','four','five'])
df = pd.DataFrame([s10, s11, s12], index=['random1', 'random2', 'random3'])
df.head()

Unnamed: 0,one,two,three,four,five
random1,1,7,8,1,8
random2,77,2,54,22,37
random3,0,0,0,0,0


In [24]:
df.loc['random1']

one      0
two      4
three    1
four     1
five     8
Name: random1, dtype: int32

**Accessing data in DataFrames**

In [25]:
# filtering using lebels
df.loc['random1', 'two']

4

In [26]:
# filtering using index
df.iloc[0,0]

0

In [27]:
# selecting multiple rows in single column
df.loc[['random1','random2']]['one']  # chaining should be avoided, can give unpredictable tesults as pandas return a copy 
                                      # of data instead of a view of data

random1     0
random2    20
Name: one, dtype: int32

In [28]:
# selecting all rows and multiple columns, columns selection is always lebel based 
df.loc[:,['one','two']]

Unnamed: 0,one,two
random1,0,4
random2,20,87
random3,0,0


In [42]:
df

Unnamed: 0,one,two,three,four,five
random1,1,7,8,1,8
random2,77,2,54,22,37
random3,0,0,0,0,0


In [43]:
# will return row indicies where condition is true
df['one']>0

random1     True
random2     True
random3    False
Name: one, dtype: bool

In [50]:
# total number of rows
df['one'].count()

3

In [52]:
# where() used to get data after conditioning
gt_zero = df.where(df['one'] > 0)
gt_zero

Unnamed: 0,one,two,three,four,five
random1,1.0,7.0,8.0,1.0,8.0
random2,77.0,2.0,54.0,22.0,37.0
random3,,,,,


In [53]:
gt_zero['one'].count()

2

In [54]:
# to remove columns that dont satisfy condition
gt_zero = gt_zero.dropna()
gt_zero

Unnamed: 0,one,two,three,four,five
random1,1.0,7.0,8.0,1.0,8.0
random2,77.0,2.0,54.0,22.0,37.0


In [57]:
#multiple conditions
len(df[(df['one']>0) | (df['two']>10)])

2

In [62]:
multi_con = df[(df['three'] > 0) & (df['four']> 5)]
multi_con

Unnamed: 0,one,two,three,four,five
random2,77,2,54,22,37


**Droping data**
* syntax: DataFrame.drop(**labels**, **axis=0**, level=None, **inplace=False**, errors='raise')
* **inplace = True** will not create a copy of data and delete observation from original data
* axis=0/1, deletes from row/column, default= 0

In [29]:
# dropping rows 
df.drop('random1')

Unnamed: 0,one,two,three,four,five
random2,20,87,90,43,68
random3,0,0,0,0,0


In [30]:
df # no change in df after dropping as .drop() doesn't remove the selected row from original data but give a copy of data 
   #without particular row. So always create a copy of data first and then proceed with .drop()

Unnamed: 0,one,two,three,four,five
random1,0,4,1,1,8
random2,20,87,90,43,68
random3,0,0,0,0,0


In [31]:
df.drop('random1', inplace = True,)

In [32]:
df

Unnamed: 0,one,two,three,four,five
random2,20,87,90,43,68
random3,0,0,0,0,0


In [33]:
df.drop('one', inplace = True, axis = 1)

In [34]:
df

Unnamed: 0,two,three,four,five
random2,87,90,43,68
random3,0,0,0,0


In [35]:
# drop column using del
df_copy = df.copy() # to copy data
del df_copy['two']  # will delete from original will not return a copy
df_copy

Unnamed: 0,three,four,five
random2,90,43,68
random3,0,0,0


**Appending a new column**

In [36]:
df['six']= None
df

Unnamed: 0,two,three,four,five,six
random2,87,90,43,68,
random3,0,0,0,0,


In [37]:
mylist = [7,7]
df['seven']= mylist
df

Unnamed: 0,two,three,four,five,six,seven
random2,87,90,43,68,,7
random3,0,0,0,0,,7


In [38]:
# numpy broadcasting 
df['seven'] *=2
df

Unnamed: 0,two,three,four,five,six,seven
random2,87,90,43,68,,14
random3,0,0,0,0,,14


**Appending Two DataFrames**

In [65]:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A','B'])
df1

Unnamed: 0,A,B
0,1,2
1,3,4


In [66]:
df2 = pd.DataFrame([[3, 4], [5, 6]], columns=['A','B'])
df2

Unnamed: 0,A,B
0,3,4
1,5,6


In [68]:
# DataFrame.append(DF, ignore_index=False, verify_integrity=False)
df3 = df1.append(df2)
df3    # notice index is preserved

Unnamed: 0,A,B
0,1,2
1,3,4
0,3,4
1,5,6


In [70]:
# with ignore_index=True
df4 = df1.append(df2,ignore_index=True)
df4

Unnamed: 0,A,B
0,1,2
1,3,4
2,3,4
3,5,6
