# Uvod v Pandas

## O Pandas-u


**Dimension & Description**



<table class="table table-bordered">
<tbody><tr>
<th style="text-align:center;">Data Structure</th>
<th style="text-align:center;">Dimensions</th>
<th style="text-align:center;">Description</th>
</tr>
<tr>
<td style="text-align:center;">Series</td>
<td style="text-align:center;">1</td>
<td style="text-align:center;">1D labeled homogeneous array, sizeimmutable.</td>
</tr>
<tr>
<td style="text-align:center;">Data Frames</td>
<td style="text-align:center;">2</td>
<td style="text-align:center;">General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed
columns.</td>
</tr>
<tr>
<td style="text-align:center;">Panel</td>
<td style="text-align:center;">3</td>
<td style="text-align:center;">General 3D labeled, size-mutable array.</td>
</tr>
</tbody></table>


## Importing pandas

In [2]:
import pandas as pd

Check the version:

In [2]:
pd.__version__

'0.24.2'

## Reminder about Built-In Documentation



For example, to display all the contents of the pandas namespace, you can type

In [None]:
pd.<TAB>

And to display Pandas's built-in documentation, you can use this:

In [3]:
pd?

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

## Introducing Pandas Objects



In [4]:
import numpy as np
import pandas as pd

### The Pandas Series Object


In [6]:
data = pd.Series([0.25,0.5,0.75,1.0])

In [7]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [8]:
data[1]

0.5

#### Series as generalized NumPy array


#### Series as specialized dictionary

In [55]:
population_dict = {'California' : 383352256,
            'Texas' : 245578987,
            'New York' : 789635412,
            'Florida' : 123365478,
            'Illonois' : 658896234}
population = pd.Series(population_dict)
#print(population)
#print()
area_dict = {'California' : 123456789,
            'Texas' : 321654987,
            'New York' : 564897231,
            'Florida' : 456789123,
            'Illonois' : 963852741}
area = pd.Series(area_dict)
#print(area)
#print()

states = pd.DataFrame({'population' : population,'area' : area})
#states
#type(states)
#states.index
#states.values
#states.columns
pd.DataFrame(population, columns = ['columns'])
pd.DataFrame(np.random.rand(3,2),
            columns = ['foo', 'bar'],
            index = ['a','b','c'])
pd.DataFrame([{'a':1, 'b':2}, {'b':3, 'c':4}])



Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [63]:
ind = pd.index=[2,3,5,7,12]
ind[1]

3

#### Constructing Series objects

`>>> pd.Series(data, index=index)`


[Dokumentacija](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

### The Pandas DataFrame Object



#### DataFrame as specialized dictionary


#### Constructing DataFrame objects



- **From a single Series object**



 - **From a dictionary of Series objects**
 - **From a two-dimensional NumPy array**
 


- **From a list of dicts**



[Dokumentacija](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

### The Pandas Index Object



#### Index as immutable array



## Importing Data with Pandas

##  Introducing DataFrames



<p><img alt="anatomy of a dataframe" src="https://s3.amazonaws.com/dq-content/291/df_anatomy.svg"></p>



## Pandas Data Selection - indexing


### Selecting pandas data using “loc” (Selecting Columns From a DataFrame by Label)



<p><img alt="loc single column" src="https://s3.amazonaws.com/dq-content/291/loc_single.svg"></p>




<p><img alt="loc list of columns" src="https://s3.amazonaws.com/dq-content/291/loc_list.svg"></p>


<p><img alt="loc slice of columns" src="https://s3.amazonaws.com/dq-content/291/loc_slice.svg"></p>


<p></p><center>
<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Common Shorthand</th>
<th>Other Shorthand</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column</td>
<td><code>df.loc[:,"col1"]</code></td>
<td><code>df["col1"]</code></td>
<td><code>df.col1</code></td>
</tr>
<tr>
<td>List of columns</td>
<td><code>df.loc[:,["col1", "col7"]]</code></td>
<td><code>df[["col1", "col7"]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of columns</td>
<td><code>df.loc[:,"col1":"col4"]</code></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</center><p></p>




### Selecting Items from a Series by Label



<p><img alt="dataframe exploded" src="https://s3.amazonaws.com/dq-content/291/df_exploded.svg"></p>



<p><img alt="series vs dataframe: series" src="https://s3.amazonaws.com/dq-content/291/df_series_s.svg"></p>


<p><img alt="series vs dataframe: dataframe" src="https://s3.amazonaws.com/dq-content/291/df_series_df.svg"></p>



<center>
<table>
<thead>
<tr>
<th></th>
<th>Series</th>
<th>DataFrame</th>
</tr>
</thead>
<tbody>
<tr>
<th>Dimensions</th>
<td>One</td>
<td>Two</td>
</tr>
<tr>
<th>Has 'index' axis</th>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<th>Has 'columns' axis</th>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<th>Number of dtypes</th>
<td>One</td>
<td>Many (one per column)</td>
</tr>
</tbody>
</table>
</center>



### Selecting Rows From a DataFrame by Label



<p><img alt="anatomy of a dataframe" src="https://s3.amazonaws.com/dq-content/291/df_anatomy_static.svg"></p>



**pandas.DataFrame.set_index**

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html?highlight=set_index#pandas.DataFrame.set_index

`DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)`

Set the DataFrame index (row labels) using one or more existing columns. By default yields a new object.

<table>
<colgroup><col class="field-name">
<col class="field-body">
</colgroup><tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><dl class="first docutils">
<dt><strong>keys</strong> <span class="classifier-delimiter">:</span> <span class="classifier">column label or list of column labels / arrays</span></dt>
<dd></dd>
</dl>
<p><strong>drop</strong> : boolean, default True</p>
<blockquote>
<div><p>Delete columns to be used as the new index</p>
</div></blockquote>
<p><strong>append</strong> : boolean, default False</p>
<blockquote>
<div><p>Whether to append columns to existing index</p>
</div></blockquote>
<p><strong>inplace</strong> : boolean, default False</p>
<blockquote>
<div><p>Modify the DataFrame in place (do not create a new object)</p>
</div></blockquote>
<p><strong>verify_integrity</strong> : boolean, default False</p>
<blockquote>
<div><p>Check the new index for duplicates. Otherwise defer the check until
necessary. Setting to False will improve the performance of this
method</p>
</div></blockquote>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><dl class="first last docutils">
<dt><strong>dataframe</strong> <span class="classifier-delimiter">:</span> <span class="classifier">DataFrame</span></dt>
<dd></dd>
</dl>
</td>
</tr>
</tbody>
</table>

### Selecting pandas data using “iloc”


## Series and Dataframe Describe Methods



### More Data Exploration Methods



<p><img alt="dataframe axis parameters" src="https://s3.amazonaws.com/dq-content/291/axis_param.svg"></p>



###  Assignment with pandas

### Using Boolean Indexing with pandas Objects



<p><img alt="Boolean arrays in pandas" src="https://s3.amazonaws.com/dq-content/291/boolean_array_pandas.svg"></p>



<p><img alt="example dataframe" src="https://s3.amazonaws.com/dq-content/291/eg_df.svg"></p>



<p><img alt="boolean series" src="https://s3.amazonaws.com/dq-content/291/bool_series.svg"></p>



<p><img alt="boolean indexing dataframe" src="https://s3.amazonaws.com/dq-content/291/boolean_indexing_df.svg"></p>



<p><img alt="boolean indexing series" src="https://s3.amazonaws.com/dq-content/291/boolean_indexing_s.svg"></p>




### Using Boolean Arrays to Assign Values



### Deleting a Column from Your DataFrame


`DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')`

Drop specified labels from rows or columns.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.htmlhttps://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

In [3]:
df = pd.DataFrame(data=np.array([[1, 2, 3], [40, 50, 9], [7, 2, 9], [40, 50, 9], [23, 35, 37]]), 
                  index= [2.5, 12.6, 4.8, 4.8, 2.5], 
                  columns=['A', 'B', 'C'])


### Removing a Row from Your DataFrame




`DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)`

Return DataFrame with duplicate rows removed, optionally only considering certain columns

<table class="docutils field-list" frame="void" rules="none">
<colgroup><col class="field-name">
<col class="field-body">
</colgroup><tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><p class="first"><strong>subset</strong> : column label or sequence of labels, optional</p>
<blockquote>
<div><p>Only consider certain columns for identifying duplicates, by
default use all of the columns</p>
</div></blockquote>
<p><strong>keep</strong> : {‘first’, ‘last’, False}, default ‘first’</p>
<blockquote>
<div><ul class="simple">
<li><code class="docutils literal notranslate"><span class="pre">first</span></code> : Drop duplicates except for the first occurrence.</li>
<li><code class="docutils literal notranslate"><span class="pre">last</span></code> : Drop duplicates except for the last occurrence.</li>
<li>False : Drop all duplicates.</li>
</ul>
</div></blockquote>
<p><strong>inplace</strong> : boolean, default False</p>
<blockquote>
<div><p>Whether to drop duplicates in place or to return a copy</p>
</div></blockquote>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><dl class="first last docutils">
<dt><strong>deduplicated</strong> <span class="classifier-delimiter">:</span> <span class="classifier">DataFrame</span></dt>
<dd></dd>
</dl>
</td>
</tr>
</tbody>
</table>

### Combining Datasets: Concat and Append


In [4]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)



In [5]:
# example DataFrame
make_df('ABCD', range(3))

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2


#### Concatenation of NumPy Arrays



In [6]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]


#### Simple Concatenation with pd.concat



`pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)`

[Dokumentacija](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)


In [7]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])


In [8]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])



In [56]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])

#### Duplicate indices



In [61]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])

##### Catching the repeats as an error

##### Ignoring the index

##### Adding MultiIndex keys

[Več o Hierarchical Indexing](https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html) 

#### Concatenation with joins

In [74]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])

#### The append() method



### Aggregation and Grouping

#### Planets dataset



#### GroupBy: Split, Apply, Combine

##### Split, apply, combine

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png" alt="">



In [9]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])


#### The GroupBy object

#### Column indexing

#### Dispatch methods



#### Aggregate, filter, transform, apply

In [10]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])


##### Aggregation


##### Filtering



##### Transformation

##### The apply() method

### Working with Time Series

#### Dates and Times in Python


##### Native Python dates and times



##### Typed arrays of times: NumPy's datetime64

[Dokumentacija](https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html)

<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<table>
<thead><tr>
<th>Code</th>
<th>Meaning</th>
<th>Time span (relative)</th>
<th>Time span (absolute)</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Y</code></td>
<td>Year</td>
<td>± 9.2e18 years</td>
<td>[9.2e18 BC, 9.2e18 AD]</td>
</tr>
<tr>
<td><code>M</code></td>
<td>Month</td>
<td>± 7.6e17 years</td>
<td>[7.6e17 BC, 7.6e17 AD]</td>
</tr>
<tr>
<td><code>W</code></td>
<td>Week</td>
<td>± 1.7e17 years</td>
<td>[1.7e17 BC, 1.7e17 AD]</td>
</tr>
<tr>
<td><code>D</code></td>
<td>Day</td>
<td>± 2.5e16 years</td>
<td>[2.5e16 BC, 2.5e16 AD]</td>
</tr>
<tr>
<td><code>h</code></td>
<td>Hour</td>
<td>± 1.0e15 years</td>
<td>[1.0e15 BC, 1.0e15 AD]</td>
</tr>
<tr>
<td><code>m</code></td>
<td>Minute</td>
<td>± 1.7e13 years</td>
<td>[1.7e13 BC, 1.7e13 AD]</td>
</tr>
<tr>
<td><code>s</code></td>
<td>Second</td>
<td>± 2.9e12 years</td>
<td>[ 2.9e9 BC, 2.9e9 AD]</td>
</tr>
<tr>
<td><code>ms</code></td>
<td>Millisecond</td>
<td>± 2.9e9 years</td>
<td>[ 2.9e6 BC, 2.9e6 AD]</td>
</tr>
<tr>
<td><code>us</code></td>
<td>Microsecond</td>
<td>± 2.9e6 years</td>
<td>[290301 BC, 294241 AD]</td>
</tr>
<tr>
<td><code>ns</code></td>
<td>Nanosecond</td>
<td>± 292 years</td>
<td>[ 1678 AD, 2262 AD]</td>
</tr>
<tr>
<td><code>ps</code></td>
<td>Picosecond</td>
<td>± 106 days</td>
<td>[ 1969 AD, 1970 AD]</td>
</tr>
<tr>
<td><code>fs</code></td>
<td>Femtosecond</td>
<td>± 2.6 hours</td>
<td>[ 1969 AD, 1970 AD]</td>
</tr>
<tr>
<td><code>as</code></td>
<td>Attosecond</td>
<td>± 9.2 seconds</td>
<td>[ 1969 AD, 1970 AD]</td>
</tr>
</tbody>
</table>

</div>
</div>

#### Dates and times in pandas: best of both worlds


#### Pandas Time Series: Indexing by Time

#### Pandas Time Series Data Structures


In [13]:
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
                       '2015-Jul-6', '07-07-2015', '20150708'])



#### Example

### Understanding SettingwithCopyWarning in pandas


#### What is SettingWithCopyWarning?



<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/view-vs-copy.png" alt="view-vs-copy">



<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/modifying.png" alt="modifying">



## Več o 

[Pandas on PyVideo](https://pyvideo.org/)