<p><img alt="udeA logo" height="150px" src="https://github.com/freddyduitama/images/blob/master/logo.png?raw=true" align="left" hspace="50px" vspace="0px" style="width:107px;height:152px;"></p>
<h1><font color='0B5345'> <center>
Pandas </center></font></h1>
<h2><font color='0B5345'> <center>
General Review</center></font></h2>
<font  face="Courier New" size="3">
<p1><center> Juliana Moreno Rada</center></p1>



<p3><center><b><font color='0B5345' face="Lucida Calligraphy,Comic Sans MS,Lucida Console" size="5">Universidad de Antioquia</font></b> </center></p3>

Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame.

**DataFrames** are multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.​

Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.​

In [1]:
import pandas as pd

In [2]:
import numpy as np

**Pandas Series Object**

A Pandas Series is a one-dimensional array of indexed data.

In [4]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. ​

In [5]:
print(data)

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


In [10]:
data[1]

0.5

In [6]:
data.values
#type(data.values)

array([0.25, 0.5 , 0.75, 1.  ])

In [7]:
data[1:3] #data can be accessed by the associated index

1    0.50
2    0.75
dtype: float64

The essential difference between Numpy and Series is the presence of the index​.

While the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [8]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) #We can even use noncontiguous or nonsequential indices
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [9]:
data['b']

0.5

**Series as specialized dictionary**

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values. ​

​This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.​

In [11]:
population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}

In [12]:
population_dict

{'California': 38332521,
 'Texas': 26448193,
 'New York': 19651127,
 'Florida': 19552860,
 'Illinois': 12882135}

In [13]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

Unlike a dictionary, though, the Series also supports array-style operations such as slicing

In [15]:
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

In [16]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [17]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [18]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

### **Pandas DataFrame Object**

If a Series is an analog of a one-dimensional array with flexible indices, a **DataFrame** is an analog of a two-dimensional array with both flexible row indices and flexible column names.

Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a **DataFrame** as a sequence of aligned Series objects.

In [19]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [25]:
population_dict = {'antioquia':452,'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
population

antioquia          452
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [23]:
datos = {'Mes':['Enero', 'Febrero', 'Marzo', 'Abril'], 'Ventas':[30500, 35600, 28300, 33900],
         'Gastos':[22000, 23400, 18100, 20700]}

contabilidad = pd.DataFrame(datos)
contabilidad

Unnamed: 0,Mes,Ventas,Gastos
0,Enero,30500,22000
1,Febrero,35600,23400
2,Marzo,28300,18100
3,Abril,33900,20700


In [22]:
datos

{'Mes': ['Enero', 'Febrero', 'Marzo', 'Abril'],
 'Ventas': [30500, 35600, 28300, 33900],
 'Gastos': [22000, 23400, 18100, 20700]}

In [26]:
states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,population,area
California,38332521,423967.0
Florida,19552860,170312.0
Illinois,12882135,149995.0
New York,19651127,141297.0
Texas,26448193,695662.0
antioquia,452,


In [27]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas', 'antioquia'], dtype='object')

In [28]:
states.columns

Index(['population', 'area'], dtype='object')

We can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. ​
For example, asking for the 'area' attribute returns a Series object:​


In [29]:
states['area']

California    423967.0
Florida       170312.0
Illinois      149995.0
New York      141297.0
Texas         695662.0
antioquia          NaN
Name: area, dtype: float64

In [30]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
print(data)
pd.DataFrame(data)


[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]


Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [31]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]) #Even if some keys in the dictionary are missing, Pandas will fill them in with NaN

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [32]:
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.74546,0.455903
b,0.071847,0.965284
c,0.311774,0.479629


In [33]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multiset, as Index objects may contain repeated values).

In [35]:
ind = pd.Index([2, 3, 5, 7, 11])
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [36]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


One difference between Index objects and NumPy arrays is that indices are immutable—that is, they cannot be modified via the normal means

In [38]:
#ind[1]
ind[1] = 0 #ERROR

TypeError: ignored

In [39]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [40]:
'a' in data

True

In [42]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [43]:
data.keys()

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [44]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0), ('e', 1.25)]

In [45]:
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [46]:
data[0:2] #Implicit index

a    0.25
b    0.50
dtype: float64

In [47]:
#Masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [48]:
#Fany indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

**Indexers: loc, iloc, and ix**

Pandas provides some special indexer attributes that explicitly expose certain indexing schemes

* The loc attribute allows indexing and slicing that always references the explicit index
* The iloc attribute allows indexing and slicing that always references the implicit Python-style index

In [49]:
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [51]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [52]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [53]:
data.loc[data.density > 100, ['pop', 'area']] #in the loc indexer we can combine masking and fancy indexing as in the following:

Unnamed: 0,pop,area
New York,19651127,141297
Florida,19552860,170312


In [55]:
data.iloc[0, 2] = 9000
data

Unnamed: 0,area,pop,density
California,423967,38332521,9000.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [56]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [57]:
data[data.density > 100]

Unnamed: 0,area,pop,density
California,423967,38332521,9000.0
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


**OPERATING ON DATA IN PANDAS**

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). ​

Pandas inherits much of this functionality from NumPy, and its ufuncs.

In [58]:
rng = np.random.RandomState(42)
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,3,7,4
1,6,9,2,6
2,7,4,3,7


In [62]:
a=np.sin(df * np.pi / 4)
a

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,-0.707107,1.224647e-16
1,-1.0,0.7071068,1.0,-1.0
2,-0.707107,1.224647e-16,0.707107,-0.7071068


In [60]:
df

Unnamed: 0,A,B,C,D
0,6,3,7,4
1,6,9,2,6
2,7,4,3,7


**Ufuncs Index Alignment:**  For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation.

In [63]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')

In [64]:
area

Alaska        1723337
Texas          695662
California     423967
Name: area, dtype: int64

In [65]:
population

California    38332521
Texas         26448193
New York      19651127
Name: population, dtype: int64

In [66]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [67]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [68]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

**HANDLING MISSING DATA**

The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous
Trade-Offs in Missing Data Conventions

A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame. Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry.​

The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. ​

Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects)​

In [69]:
vals1 = np.array([1, None, 3, 4])
vals1 #means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects

array([1, None, 3, 4], dtype=object)

The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

In [70]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code.

In [71]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

**Operating on Null Values**

* isnull(): Generate a Boolean mask indicating missing values​

* notnull(): Opposite of isnull()​

* dropna(): Return a filtered version of the data​

* fillna(): Return a copy of the data with missing values filled or imputed

In [72]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [73]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [74]:
a=data[data.notnull()]

In [78]:
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [76]:
a

0        1
2    hello
dtype: object

In [83]:
a=data.dropna()

In [82]:
data

0        1
1      NaN
2    hello
3     None
dtype: object

For a DataFrame, there are more options. Consider the following DataFrame:​

In [84]:
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We cannot drop single values from a DataFrame; we can only drop full rows or full columns. ​

By default, dropna() will drop all rows in which any null value is present:​

In [85]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [86]:
df[3] = np.nan
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


Sometimes rather than dropping NA values, you’d rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.

In [90]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
b=data.fillna(0)

In [91]:
b

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [None]:
#data.fillna(method='ffill')
#data.fillna(method='bfill')

data.fillna(method='ffill')