-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

## Intro to Pandas 

This notebook has copied the code from the [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/version/0.23.0/10min.html) for use in a Databricks notebook.

PLEASE do not upgrade the version of pandas in Databricks as it breaks the Databricks `display()` command.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Object Creation


See the [Data Structure Intro section](https://pandas.pydata.org/pandas-docs/version/0.23.0/dsintro.html#dsintro)

Creating a [Series](https://pandas.pydata.org/pandas-docs/version/0.23.0/generated/pandas.Series.html#pandas.Series) by passing a list of values, letting pandas create a default integer index:

In [5]:
s = pd.Series([1,3,5,np.nan,6,8])
s

Creating a [DataFrame](https://pandas.pydata.org/pandas-docs/version/0.23.0/generated/pandas.DataFrame.html#pandas.DataFrame) by passing a numpy array, with a datetime index and labeled columns:

In [7]:
dates = pd.date_range('20130101', periods=6)
dates

In [8]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [10]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })

df2

Having specific [dtypes](https://pandas.pydata.org/pandas-docs/version/0.23.0/basics.html#basics-dtypes)

In [12]:
df2.dtypes

## Viewing Data

See the [Basics section](https://pandas.pydata.org/pandas-docs/version/0.23.0/basics.html#basics)

See the top & bottom rows of the frame

In [14]:
df.head()

In [15]:
df.tail(3)

Display the index, columns, and the underlying numpy data

In [17]:
df.index

In [18]:
df.columns

In [19]:
df.values

Describe shows a quick statistic summary of your data

In [21]:
df.describe()

Transposing your data

In [23]:
df.T

Sorting by an axis

In [25]:
df.sort_index(axis=1, ascending=False)

Sorting by values

In [27]:
df.sort_values(by='B')

## Selection

Note While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix.


See the indexing documentation [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/version/0.23.0/indexing.html#indexing) and [MultiIndex / Advanced Indexing](https://pandas.pydata.org/pandas-docs/version/0.23.0/advanced.html#advanced)

Selecting a single column, which yields a Series, equivalent to df.A

In [30]:
df['A']

Selecting via [], which slices the rows.

In [32]:
df[0:3]

In [33]:
df['20130102':'20130104']

See more in [Selection by Label](https://pandas.pydata.org/pandas-docs/version/0.23.0/indexing.html#indexing-label)

For getting a cross section using a label

In [35]:
df.loc[dates[0]]

In [36]:
df.loc[:,['A','B']]

Showing label slicing, both endpoints are included

In [38]:
df.loc['20130102':'20130104',['A','B']]

Reduction in the dimensions of the returned object

In [40]:
df.loc['20130102',['A','B']]

For getting a scalar value

In [42]:
df.loc[dates[0],'A']

For getting fast access to a scalar (equiv to the prior method)

In [44]:
df.at[dates[0],'A']

## Selection by Position

See more in [Selection by Position](https://pandas.pydata.org/pandas-docs/version/0.23.0/indexing.html#indexing-integer)

Select via the position of the passed integers

In [46]:
df.iloc[3]

By integer slices, acting similar to numpy/python

In [48]:
df.iloc[3:5,0:2]

By lists of integer position locations, similar to the numpy/python style

In [50]:
df.iloc[[1,2,4],[0,2]]

For slicing rows explicitly

In [52]:
df.iloc[1:3,:]

For slicing columns explicitly

In [54]:
df.iloc[:,1:3]

For getting a value explicitly

In [56]:
df.iloc[1,1]

For getting fast access to a scalar (equiv to the prior method)

In [58]:
df.iat[1,1]

## Boolean Indexing

Using a single column’s values to select data.

In [60]:
df[df.A > 0]

Selecting values from a DataFrame where a boolean condition is met.

In [62]:
df[df > 0]

Using the [isin()](https://pandas.pydata.org/pandas-docs/version/0.23.0/generated/pandas.Series.isin.html#pandas.Series.isin) method for filtering:

In [64]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

In [65]:
df2[df2['E'].isin(['two','four'])]

Setting a new column automatically aligns the data by the indexes

In [67]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1

In [68]:
df['F'] = s1

Setting values by label

In [70]:
df.at[dates[0],'A'] = 0

Setting values by position

In [72]:
df.iat[0,1] = 0

Setting by assigning with a numpy array

In [74]:
df.loc[:,'D'] = np.array([5] * len(df))

The result of the prior setting operations

In [76]:
df

A where operation with setting.

In [78]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

We will continue in the [next notebook]($./00c Pandas Tutorial II)

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>