# Pandas

Pandas is an open-source, BSD-licensed Python library providing **high-performance, 
easy-to-use data structures and data analysis tools** for the Python programming language. 
Python with Pandas is **used in a wide range of fields including academic and commercial domains including finance, 
economics, Statistics, analytics** etc.In this tutorial, we will learn the various features of Python Pandas 
and how to use them in practice.


### Pandas = Numpy + Algorithms


### Key Features of Pandas:
* **Fast and efficient** DataFrame object with default and customized indexing.
* Tools for **loading data into in-memory** data objects from different file formats.
* Data alignment and integrated **handling of missing data**.
* **Reshaping and pivoting** of date sets.
* **Label-based slicing, indexing and subsetting** of large data sets.
* Columns from a data structure **can be deleted or inserted**.
* **Group by data** for aggregation and transformations.
* High performance **merging and joining of data.**
* **Time Series** functionality.

### Installation
* pip install pandas
* conda install pandas

### Topics
* Introduction to Pandas Data Structures
* Descriptive Analysis
* Pandas Input-Output
* Pandas Manipulation
* Pandas Groupby
* Importing Libraries - Creating Data Sets - Creating Data Frames - Reading From CSV - Exporting To CSV - Finding * Maximums - Plotting Data
* Reading From TXT - Exporting To TXT - Selecting Top/Bottom Records - Descriptive Statistics - Grouping/Sorting Data

# Introduction to Pandas Data Structures

### Pandas deals with the following three data structures −
* Series
* DataFrame
* Panel

These data structures are built on top of Numpy array, which means they are fast.


#### Series
Series is a one-dimensional array like structure with homogeneous data.

10	23	56	17	52	61	73	90	26	72

* Key Points
    * Homogeneous data
    * Size Immutable (can't add new object, i.e. add()/append() functions will not work)
    * Values of Data Mutable (i.e. s[0] = 10 will work )


#### DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,

Name	Age	Gender	Rating<br>
Steve	32	Male	3.45<br>
Lia	    28	Female	4.6<br>
Vin	    45	Male	3.9<br>
Katie	38	Female	2.78<br>

* Key Points
    * Heterogeneous data
    * Size Mutable
    * Data Mutable


#### Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.

* Key Points
    * Heterogeneous data
    * Size Mutable
    * Data Mutable

#### Panel (3D)  >  DF (2D)  >  Series (1D)

#### Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.


# Read & Write diffrent files using Pandas

### 1. read_csv(.csv)

In [4]:
import pandas as pd
draft1 = pd.read_csv('test.csv')    # Supply the file name (path)

#draft1.head(6)                           # Check the first 6 rows

# draft1.head(4)

draft1.iloc[1:4, 0:3]
#draft1.head()  # def value is 5

Unnamed: 0,x,y,z
1,1,25,30
2,2,45,50
3,3,65,55


### 2. read_table(.tsv)

In [6]:
# TSV: tab-delimited file (Its a subset of csv file)

# Excel > CSV > TSV

'''
TSV is a file extension for a tab-delimited file used with spreadsheet software. 
TSV stands for Tab Separated Values. TSV files are used for raw data and can be imported into and exported from 
spreadsheet software. 

Diff btwn csv & tsv:
In some enterprise environments TSV is so prevalent that its bulk usage in those cases that it makes it very difficult
to judge. For instance, the default text format used in Hadoop is tab delimited files, 
using CSV is much more complicated. 
Given that the petabytes of data that has been stored in this format, 
I think it is likely that there is more data in the world stored in TSV, 
but that there are more CSV files in existance.
'''
draft2 = pd.read_csv('test.tsv',sep="\t")
# draft2 = pd.read_table('test.tsv')  # Read a tsv into a DataFrame
draft2.head(6)                           # Check the first 6 rows

Unnamed: 0,Name,Age,Address
0,Paul,23,1115 W Franklin
1,Bessy the Cow,5,Big Farm Way
2,Zeke,45,W Main St


### 3. read_excel(.xlsx)

In [407]:
draft3 = pd.read_excel('test.xlsx')       # Path to Excel file
# draft3 = pd.read_excel('test.xlsx', sheetname = 'sheet1') # Name of sheet to read from

draft3.head(6)                            # Check the first 6 rows

Unnamed: 0,Category,Function,Description,New?,Help Topic,Help
0,Logical,IFERROR,Returns a different result if the first argume...,True,HA01231765,<More Info>
1,Statistical,AVERAGEIF,Returns the average for the cells specified by...,True,HA10047433,<More Info>
2,Statistical,AVERAGEIFS,Returns the average for the cells specified by...,True,HA10047493,<More Info>
3,Statistical,COUNTIFS,Counts the number of cells that meet multiple ...,True,HA10047494,<More Info>
4,Math & Trig,SUMIFS,Adds the cells specified by a multiple criteria,True,HA10047504,<More Info>
5,Cube,CUBEMEMBER,Returns a member or tuple in a cube hierarchy,True,HA10083017,<More Info>


### 4. read_html(.html from url)

In [17]:
import pandas as pd
# url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
# url = "https://www.w3schools.com/bootstrap/bootstrap_tables.asp"
url="https://en.wikipedia.org/wiki/Indian_Institutes_of_Information_Technology"

# it will extract all tables from url
BB_data = pd.read_html(url)         # Read data from the specified url
# print(BB_data)

# print(BB_data[0].iloc[:, :].head())      # Check 5 rows (10 columns only)
BB_data[0].iloc[:, :] 

Unnamed: 0,0,1,2,3,4
0,Name,Short name,Established,Mode,State/UT
1,Atal Bihari Vajpayee Indian Institute of Infor...,ABV-IIITM Gwalior,1997,MHRD,Madhya Pradesh
2,"Indian Institute of Information Technology, Al...",IIITA,1999,MHRD,Uttar Pradesh
3,"Indian Institute of Information Technology, De...",IIITDMJ,2005,MHRD,Madhya Pradesh
4,"Indian Institute of Information Technology, De...",IIITDM Kancheepuram,2007,MHRD,Tamil Nadu
5,"Indian Institute of Information Technology, Sr...",IIITS,2013,PPP,Andhra Pradesh
6,"Indian Institute of Information Technology, Gu...",IIITG,2013,PPP,Assam
7,"Indian Institute of Information Technology, Va...",IIITV,2013,PPP,Gujarat
8,"Indian Institute of Information Technology, Kota",IIITK,2013,PPP,Rajasthan
9,Indian Institute of Information Technology Tir...,IIITT,2013,PPP,Tamil Nadu


### 5. read_clipboard

In [20]:
BB_reference_data = pd.read_clipboard(sep="\t")  # Read data from the clipboard

BB_reference_data.iloc[:, :]   # Check 5 rows (10 columns only)
#BB_reference_data.ix[:4,0:10]

  return read_table(StringIO(text), sep=sep, **kwargs)


Unnamed: 0,0,Paul,23,1115 W Franklin
0,1,Bessy the Cow,5,Big Farm Way
1,2,Zeke,45,W Main St


# Save data 

In [23]:
BB_reference_data.to_csv("save_data1.csv") 

import os
print(os.listdir(os.getcwd()))
data=pd.read_csv("save_data1.csv")
print(100*"-")
data.head()

['.DS_Store', '.ipynb_checkpoints', 'save_data.csv', 'save_data1.csv', 'session9_Pandas+I.ipynb', 'test.csv', 'test.tsv']
----------------------------------------------------------------------------------------------------


Unnamed: 0.1,Unnamed: 0,0,Paul,23,1115 W Franklin
0,0,1,Bessy the Cow,5,Big Farm Way
1,1,2,Zeke,45,W Main St


# Playing with titanic_train Dataset

In [25]:
import numpy as np
import pandas as pd

# import os
# import matplotlib.pyplot as plt
# %matplotlib inline

In [26]:
url = "https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv"
titanic_train = pd.read_csv(url, sep='\t')
titanic_train.head()
# reading from file
# titanic_train = pd.read_csv("titanic.csv", sep='\t') 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [40]:
titanic_train.iloc[:,:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [38]:
titanic_train.shape 

(156, 12)

In [427]:
titanic_train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

### df.describe

Generate various summary statistics.

- The output DataFrame index depends on the requested dtypes:
    - For numeric dtypes, it will include: count, mean, std, min, max, and lower, and upper percentiles.
    - For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.
    - For mixed dtypes, the index will be the union of the corresponding output types. Non-applicable entries will be filled with NaN. Note that mixed-dtype outputs can only be returned from mixed-dtype inputs and appropriate use of the include/exclude arguments.

If multiple values have the highest count, then the count and most common pair will be arbitrarily chosen from among those with the highest count.

The include, exclude arguments are ignored for Series.

In [50]:
# For numeric dtypes

titanic_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,156.0,156.0,156.0,126.0,156.0,156.0,156.0
mean,78.5,0.346154,2.423077,28.141508,0.615385,0.397436,28.109587
std,45.177428,0.477275,0.795459,14.61388,1.056235,0.870146,39.401047
min,1.0,0.0,1.0,0.83,0.0,0.0,6.75
25%,39.75,0.0,2.0,19.0,0.0,0.0,8.00315
50%,78.5,0.0,3.0,26.0,0.0,0.0,14.4542
75%,117.25,1.0,3.0,35.0,1.0,0.0,30.37185
max,156.0,1.0,3.0,71.0,5.0,5.0,263.0


In [46]:
#from here
# For categorical(object) dtypes

categorical = titanic_train.dtypes[titanic_train.dtypes == "object"].index
print(categorical)
titanic_train[categorical].describe()

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,156,156,156,31,155
unique,156,2,145,28,3
top,"Mionoff, Mr. Stoytcho",male,349909,C123,S
freq,1,100,2,2,110


In [433]:
sorted(titanic_train["Name"])[0:15]  # Check the first 15 sorted names

['Ahlin, Mrs. Johan (Johanna Persdotter Larsson)',
 'Allen, Mr. William Henry',
 'Andersson, Miss. Ellis Anna Maria',
 'Andersson, Miss. Erna Alexandra',
 'Andersson, Mr. Anders Johan',
 'Andersson, Mr. August Edvard ("Wennerstrom")',
 'Andreasson, Mr. Paul Edvin',
 'Andrew, Mr. Edgardo Samuel',
 'Arnold-Franchi, Mrs. Josef (Josefine Franchi)',
 'Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)',
 'Attalah, Miss. Malake',
 'Backstrom, Mrs. Karl Alfred (Maria Mathilda Gustafsson)',
 'Barton, Mr. David John',
 'Bateman, Rev. Robert James',
 'Baxter, Mr. Quigg Edmond']

In [441]:
titanic_train["Name"].describe()

count                       156
unique                      156
top       Pekoniemi, Mr. Edvard
freq                          1
Name: Name, dtype: object

In [54]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [29]:
# divide in 2 categories
# print(titanic_train["Survived"])
new_survived1 = pd.Categorical(titanic_train["Survived"])
print(new_survived1)
print(100*"-")
print(new_survived1.describe())

new_survived2 = new_survived1.rename_categories(["Died","Survived"])              

print(new_survived2)
print(100*"-")
new_survived2.describe()

[0, 1, 1, 1, 0, ..., 1, 0, 0, 0, 0]
Length: 156
Categories (2, int64): [0, 1]
----------------------------------------------------------------------------------------------------
            counts     freqs
categories                  
0              102  0.653846
1               54  0.346154
[Died, Survived, Survived, Survived, Died, ..., Survived, Died, Died, Died, Died]
Length: 156
Categories (2, object): [Died, Survived]
----------------------------------------------------------------------------------------------------


Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
Died,102,0.653846
Survived,54,0.346154


In [62]:
new_Pclass1 = pd.Categorical(titanic_train["Pclass"], ordered=True)
print(new_Pclass1)


new_Pclass2 = new_Pclass1.rename_categories(["Class1","Class2","Class3"])     
print(new_Pclass2)
new_Pclass2.describe()

[3, 1, 3, 1, 3, ..., 1, 3, 3, 3, 1]
Length: 156
Categories (3, int64): [1 < 2 < 3]
[Class3, Class1, Class3, Class1, Class3, ..., Class1, Class3, Class3, Class3, Class1]
Length: 156
Categories (3, object): [Class1 < Class2 < Class3]


Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
Class1,30,0.192308
Class2,30,0.192308
Class3,96,0.615385


In [63]:
titanic_train["Pclass"] = new_Pclass2
titanic_train["Survived"]=new_survived2
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,Died,Class3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,Survived,Class1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,Survived,Class3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,Survived,Class1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,Died,Class3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Some df functions

In [449]:
# del keyword

# Remove PassengerId (will be removed from df not from the actual data source)
del titanic_train["PassengerId"]    
titanic_train.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,Class3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,Class1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,Class3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,Class1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,Class3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
# reloading data
titanic_train = pd.read_csv(url, sep='\t')
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [32]:
# unique()

titanic_train["Cabin"].unique()   # Check unique cabins

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2'], dtype=object)

In [38]:
# Convert data to str: astype(str)

import numpy as np
print(titanic_train.head())
print(100*"-")

# read all cabin as string
char_cabin = titanic_train["Cabin"].astype(str) 
print(char_cabin.head())
# print(char_cabin.unique(),"\n\n")
# print(char_cabin.head(),"\n\n")

# Take first letter & create np array
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) 
print(new_Cabin,"\n\n")

# create Categorical object
new_Cabin = pd.Categorical(new_Cabin)
print(new_Cabin,"\n\n")
new_Cabin.describe()

# titanic_train["Cabin"] = new_Cabin
# titanic_train.head()



   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
--

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2,0.012821
B,5,0.032051
C,10,0.064103
D,6,0.038462
E,3,0.019231
F,4,0.025641
G,1,0.00641
n,125,0.801282


In [456]:
titanic_train["Cabin"] = new_Cabin
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


# Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

#### pandas.Series

pandas.Series( data, index, dtype, copy)
* data: data takes various forms like ndarray, list, constants
* index: Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.
* dtype: dtype is for data type. If None, data type will be inferred
* copy: Copy data. Default False

In [68]:
import numpy as np
import pandas as pd

labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10,'b':20,'c':30}

print ("Labels:", labels)
print("My data:", my_data)
print("Dictionary:", d)
print(arr)


Labels: ['a', 'b', 'c']
My data: [10, 20, 30]
Dictionary: {'a': 10, 'b': 20, 'c': 30}
[10 20 30]


In [459]:
pd.Series(data=my_data, index=labels)

a    10
b    20
c    30
dtype: int64

In [461]:
print ("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(arr))



Holding numerical data
-------------------------
0    10
1    20
2    30
dtype: int64


In [462]:
print ("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))



Holding text labels
--------------------
0    a
1    b
2    c
dtype: object


In [156]:
print ("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))



Holding functions
--------------------
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object


In [71]:
print ("\nHolding objects from a dictionary\n",'-'*40, sep='')
s = pd.Series(data=[d.keys, d.items, d.values])
print(pd.Series(data=[d.keys, d.items, d.values]))
s[2]


Holding objects from a dictionary
----------------------------------------
0    <built-in method keys of dict object at 0x0000...
1    <built-in method items of dict object at 0x000...
2    <built-in method values of dict object at 0x00...
dtype: object


<function dict.values>

### Some oprations on Series

In [40]:
# Access Series elements

ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])

# access by lable
print(ser1['CO'])

#access by index
print(ser1[2],"\n\n")
print(ser1[::])
# access range
print(ser1[::-1])

3
3 


CA    1
OR    2
CO    3
AZ    4
dtype: int64
AZ    4
CO    3
OR    2
CA    1
dtype: int64


In [482]:
# Arithmetic operations on Series
#Python tries to add values where it finds common index name, and puts NaN where indices are missing

ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])
ser2 = pd.Series([1,2,5,4],['CA', 'OR', 'NV', 'AZ'])

print(ser1+ser2,"\n\n")
print(ser1*ser2,"\n\n")
print(ser1/ser2,"\n\n")

# combination of mathematical operations!
print(np.exp(ser1)+np.log10(ser2))

AZ    8.0
CA    2.0
CO    NaN
NV    NaN
OR    4.0
dtype: float64 


AZ    16.0
CA     1.0
CO     NaN
NV     NaN
OR     4.0
dtype: float64 


AZ    1.0
CA    1.0
CO    NaN
NV    NaN
OR    1.0
dtype: float64 


AZ    55.200210
CA     2.718282
CO          NaN
NV          NaN
OR     7.690086
dtype: float64


In [76]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,Died,Class3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,Survived,Class1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,Survived,Class3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,Survived,Class1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,Died,Class3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [478]:
# Max

index = np.where(titanic_train["Fare"] == max(titanic_train["Fare"]) )
print(index)
titanic_train.iloc[index]

(array([27, 88]),)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C,S


In [41]:
# Combind 2 column and create new column
# new column will be added at the end
titanic_train=pd.read_csv(url,sep="\t")
titanic_train["Family"] = titanic_train["SibSp"] + titanic_train["Parch"]
print(titanic_train.head())
most_family = np.where(titanic_train["Family"] == max(titanic_train["Family"]))


del titanic_train["SibSp"]
titanic_train.iloc[most_family]

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  Family  
0      0         A/5 21171   7.2500   NaN        S       1  
1      0          PC 17599  71.2833   C85        C       1  
2      0  STON/O2. 3101282   7.9250   NaN        S       0  
3      0            113803  53.1000  C123        S       1  
4      0       

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Family
59,60,0,3,"Goodwin, Master. William Frederick",male,11.0,2,CA 2144,46.9,,S,7
71,72,0,3,"Goodwin, Miss. Lillian Amy",female,16.0,2,CA 2144,46.9,,S,7


### check for null values

In [92]:
dummy_vector = pd.Series([1,None,3,None,7,8])
print(dummy_vector)
print(dummy_vector.isnull())
dummy_vector[dummy_vector.isnull()]

0    1.0
1    NaN
2    3.0
3    NaN
4    7.0
5    8.0
dtype: float64
0    False
1     True
2    False
3     True
4    False
5    False
dtype: bool


1   NaN
3   NaN
dtype: float64

In [98]:
print(titanic_train["Age"][:20],"\n\n")

missing = np.where(titanic_train["Age"].isnull() == True)


print(len(missing[0]))
print(missing)

0     22.0
1     38.0
2     26.0
3     35.0
4     35.0
5      NaN
6     54.0
7      2.0
8     27.0
9     14.0
10     4.0
11    58.0
12    20.0
13    39.0
14    14.0
15    55.0
16     2.0
17     NaN
18    31.0
19     NaN
Name: Age, dtype: float64 


30
(array([  5,  17,  19,  26,  28,  29,  31,  32,  36,  42,  45,  46,  47,
        48,  55,  64,  65,  76,  77,  82,  87,  95, 101, 107, 109, 121,
       126, 128, 140, 154], dtype=int64),)


# Data Frames

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

### Features of DataFrame
* Potentially columns are of different types
* Size – Mutable
* Labeled axes (rows and columns)
* Can Perform Arithmetic operations on rows and columns

### pandas.DataFrame
pandas.DataFrame( data, index, columns, dtype, copy)

* Parameter & Description
    * data: data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
    * index: For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.
    * columns: For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed.
    * dtype: Data type of each column.
    * copy: This command (or whatever it is) is used for copying of data, if the default is False.
    
### Create DataFrame
A pandas DataFrame can be created using various inputs like
* Lists
* dict
* Series
* Numpy ndarrays
* Another DataFrame

In [100]:
# Create DF

from numpy.random import randn as rn
# np.random.seed(100)

matrix_data = rn(5,4)
# print(matrix_data)

row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)

print("\nThe data frame looks like\n",'-'*45, sep='')
df


The data frame looks like
---------------------------------------------


Unnamed: 0,W,X,Y,Z
A,1.120351,-1.294511,-0.084506,1.826561
B,0.662123,-0.348818,1.773392,-0.903199
C,0.395988,-2.237833,0.26004,0.931202
D,-1.912226,0.564691,-0.41731,-0.158989
E,-0.62128,-0.126943,-1.110478,-0.262472


In [101]:
# Access Column 
## df.X (NOT RECOMMENDED)
## df[col_name]

print("\nThe 'X' and 'Z' columns indexed by passing a list\n",'-'*55, sep='')
print(df[['X','Z']])

#col type
print("\nType of the column: ", type(df['X']), sep='')

# for more than one column, the object turns into a DataFrame
print("\nType of the pair of columns: ", type(df[['X','Z']]), sep='')
df[['X','Z']]


The 'X' and 'Z' columns indexed by passing a list
-------------------------------------------------------
          X         Z
A -1.294511  1.826561
B -0.348818 -0.903199
C -2.237833  0.931202
D  0.564691 -0.158989
E -0.126943 -0.262472

Type of the column: <class 'pandas.core.series.Series'>

Type of the pair of columns: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,X,Z
A,-1.294511,1.826561
B,-0.348818,-0.903199
C,-2.237833,0.931202
D,0.564691,-0.158989
E,-0.126943,-0.262472


In [499]:
# drop column
#del df[colname]

print("\nA column is dropped by using df.drop() method\n",'-'*55, sep='')
df = df.drop('W', axis=1) # Notice the axis=1 option, axis = 0 is default, so one has to change it to 1
df


A column is dropped by using df.drop() method
-------------------------------------------------------


Unnamed: 0,X,Y,Z
A,-0.668172,0.007315,-0.612939
B,-1.733096,-0.98331,0.357508
C,1.470714,-1.188018,-0.549746
D,-0.827932,0.108863,0.50781
E,1.24947,-0.079611,-0.889731


In [502]:
# recreate df
df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
df

Unnamed: 0,W,X,Y,Z
A,-0.544439,-0.668172,0.007315,-0.612939
B,1.299748,-1.733096,-0.98331,0.357508
C,-1.613579,1.470714,-1.188018,-0.549746
D,-0.940046,-0.827932,0.108863,0.50781
E,-0.862227,1.24947,-0.079611,-0.889731


In [501]:
# df1=df.drop('A')
df1=df.drop('A', axis = 0)
print("\nA row (index) is dropped by using df.drop() method and axis=0\n",'-'*65, sep='')
print(df1)



A row (index) is dropped by using df.drop() method and axis=0
-----------------------------------------------------------------
          W         X         Y         Z
B  1.299748 -1.733096 -0.983310  0.357508
C -1.613579  1.470714 -1.188018 -0.549746
D -0.940046 -0.827932  0.108863  0.507810
E -0.862227  1.249470 -0.079611 -0.889731


### Selecting/indexing Rows
* Label-based 'loc' method
* Index (numeric) 'iloc' method

In [503]:
print(df)
print("\nLabel-based 'loc' method can be used for selecting row(s)\n",'-'*60, sep='')

print("\nSingle row\n")
#df[colname]
print(df.loc['C'],"\n")
print(df.iloc[2])
print('-'*60, sep='')

print("\nMultiple rows\n")
print(df.loc[['B','C']],"\n")
print(df.iloc[[1,2]])

          W         X         Y         Z
A -0.544439 -0.668172  0.007315 -0.612939
B  1.299748 -1.733096 -0.983310  0.357508
C -1.613579  1.470714 -1.188018 -0.549746
D -0.940046 -0.827932  0.108863  0.507810
E -0.862227  1.249470 -0.079611 -0.889731

Label-based 'loc' method can be used for selecting row(s)
------------------------------------------------------------

Single row

W   -1.613579
X    1.470714
Y   -1.188018
Z   -0.549746
Name: C, dtype: float64 

W   -1.613579
X    1.470714
Y   -1.188018
Z   -0.549746
Name: C, dtype: float64
------------------------------------------------------------

Multiple rows

          W         X         Y         Z
B  1.299748 -1.733096 -0.983310  0.357508
C -1.613579  1.470714 -1.188018 -0.549746 

          W         X         Y         Z
B  1.299748 -1.733096 -0.983310  0.357508
C -1.613579  1.470714 -1.188018 -0.549746


### Subsetting DataFrame

In [504]:
matrix_data = rn(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
df

Unnamed: 0,W,X,Y,Z
A,-0.881798,0.018639,0.237845,0.013549
B,-1.635529,-1.04421,0.613039,0.736205
C,1.026921,-1.432191,-1.841188,0.366093
D,-0.331777,-0.689218,2.034608,-0.550714
E,0.750453,-1.306992,0.580573,-1.104523


In [506]:
print("\nElement at row 'B' and column 'Y' is") 
print(df.loc['B','Y'])
print(df.iloc[1,2])

# fetch sub dataframe
print("\nSubset comprising of rows B and D, and columns W and Y, is")
print(df.loc[ ['B','D'], ['W','Y'] ])


Element at row 'B' and column 'Y' is
0.6130388816875463
0.6130388816875463

Subset comprising of rows B and D, and columns W and Y, is
          W         Y
B -1.635529  0.613039
D -0.331777  2.034608


In [103]:
# Boolean DataFrame(s)

print(df,"\n\n")
booldf = df>0
print(booldf,"\n\n")

print("\nDataFrame indexed by boolean dataframe\n",'-'*45, sep='')
print(df[booldf])


# print(df,"\n\n")
# #booldf = df>0
# lam = lambda a: df>0 
# print(lam,"\n\n")

# print("\nDataFrame indexed by boolean dataframe\n",'-'*45, sep='')
# print(df[lam])

          W         X         Y         Z
A  1.120351 -1.294511 -0.084506  1.826561
B  0.662123 -0.348818  1.773392 -0.903199
C  0.395988 -2.237833  0.260040  0.931202
D -1.912226  0.564691 -0.417310 -0.158989
E -0.621280 -0.126943 -1.110478 -0.262472 


       W      X      Y      Z
A   True  False  False   True
B   True  False   True  False
C   True  False   True   True
D  False   True  False  False
E  False  False  False  False 



DataFrame indexed by boolean dataframe
---------------------------------------------
          W         X         Y         Z
A  1.120351       NaN       NaN  1.826561
B  0.662123       NaN  1.773392       NaN
C  0.395988       NaN  0.260040  0.931202
D       NaN  0.564691       NaN       NaN
E       NaN       NaN       NaN       NaN


In [512]:
df

Unnamed: 0,W,X,Y,Z
A,-0.881798,0.018639,0.237845,0.013549
B,-1.635529,-1.04421,0.613039,0.736205
C,1.026921,-1.432191,-1.841188,0.366093
D,-0.331777,-0.689218,2.034608,-0.550714
E,0.750453,-1.306992,0.580573,-1.104523


In [104]:
print(df['W']>0.2)
print("\nRows with W > 0.2\n",'-'*35, sep='')
print(df[ df['W']>0.2    ])


# print("\nRows with W > 0.2\n",'-'*35, sep='')
# col = df['W']>0.2
# print(col)
# lst = df[ col ]
# print(lst)



A     True
B     True
C     True
D    False
E    False
Name: W, dtype: bool

Rows with W > 0.2
-----------------------------------
          W         X         Y         Z
A  1.120351 -1.294511 -0.084506  1.826561
B  0.662123 -0.348818  1.773392 -0.903199
C  0.395988 -2.237833  0.260040  0.931202


### Multi-indexing ( Hierarchical indexing )

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays), an array of tuples (using MultiIndex.from_tuples), or a crossed set of iterables (using MultiIndex.from_product). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples.

In [106]:
# Index Levels

outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
print(list(zip(outside,inside)))
hier_index = list(zip(outside,inside))

print("\nTuple pairs after the zip and list command\n",'-'*45, sep='')
hier_index

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

Tuple pairs after the zip and list command
---------------------------------------------


[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

In [520]:
# creating Multi-Index

hier_index = pd.MultiIndex.from_tuples(hier_index)
print("\nIndex hierarchy\n",'-'*25, sep='')
print(hier_index,"\n\n")
print(type(hier_index))


Index hierarchy
-------------------------
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]]) 


<class 'pandas.core.indexes.multi.MultiIndex'>


In [521]:
print("\nCreating DataFrame with multi-index\n",'-'*37, sep='')

data=np.random.randint(1,10,(6,3))
# data=np.random.randint(1,10,(5,3)) #it will through error

print(data,"\n\n")
df1 = pd.DataFrame(data, index= hier_index, columns= ['A','B','C'])
print(df1)


Creating DataFrame with multi-index
-------------------------------------
[[8 3 2]
 [3 8 2]
 [1 6 4]
 [6 3 7]
 [2 2 6]
 [3 6 7]] 


      A  B  C
G1 1  8  3  2
   2  3  8  2
   3  1  6  4
G2 1  6  3  7
   2  2  2  6
   3  3  6  7


In [522]:
print("\nNaming the indices by 'index.names' method\n",'-'*45, sep='')
df1.index.names=['Outer', 'Inner']
print(df1)


Naming the indices by 'index.names' method
---------------------------------------------
             A  B  C
Outer Inner         
G1    1      8  3  2
      2      3  8  2
      3      1  6  4
G2    1      6  3  7
      2      2  2  6
      3      3  6  7


In [101]:
#cross tabulation like pivot table

In [523]:
print("\nGrabbing a cross-section from outer level\n",'-'*45, sep='')
print(df1.xs('G1'))



Grabbing a cross-section from outer level
---------------------------------------------
       A  B  C
Inner         
1      8  3  2
2      3  8  2
3      1  6  4


In [525]:
print("\nGrabbing a cross-section from inner level (for all outer levels)\n",'-'*65, sep='')
print(df1.xs(2,level='Inner'))


Grabbing a cross-section from inner level (for all outer levels)
-----------------------------------------------------------------
       A  B  C
Outer         
G1     3  8  2
G2     2  2  6


---
---

# TODO

### Q1 Read tsv file

In [46]:
import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
    
df = pd.read_csv(url, sep = '\t')

df.iloc[::-1]
# df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
4621,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$8.75
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4616,1832,1,Chips and Guacamole,,$4.45
4615,1832,1,Chicken Soft Tacos,"[Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]]",$8.75
4614,1831,1,Bottled Water,,$1.50
4613,1831,1,Chips,,$2.15
4612,1831,1,Carnitas Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$9.25


### Q2 Print first 5 and last 7 records

In [8]:
df.head(5)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [5]:
df.tail(7)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
4615,1832,1,Chicken Soft Tacos,"[Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]]",$8.75
4616,1832,1,Chips and Guacamole,,$4.45
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
4621,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$8.75


### Q3 Print total records and type of variables

In [8]:
print("type of variables: ",df.info())
print("-"*50)
print("Total Records: ",df.shape[0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id              4622 non-null int64
quantity              4622 non-null int64
item_name             4622 non-null object
choice_description    3376 non-null object
item_price            4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.6+ KB
type of variables:  None
--------------------------------------------------
Total Records:  4622


### Q4 Print the name of all the columns & How is the dataset indexed

In [9]:
print(df.columns,"\n\n")
print(df.index)

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object') 


RangeIndex(start=0, stop=4622, step=1)


### Q5 Which was the most ordered item? and How many items were ordered?

In [47]:
# c = df.groupby('item_name').sum()
# print(c)
# c = c.sort_values(['quantity'], ascending=False)
# print(c.head(10))
# c.head(1)['quantity']
c=df.groupby("item_name").sum()
# print(c)
c=c.sort_values(['quantity'],ascending=False)
print(c.head(1)['quantity'])
# df.head()

item_name
Chicken Bowl    761
Name: quantity, dtype: int64


### Q6 What was the most ordered item in the choice_description column?

In [540]:
#print(df.iloc[:5,3:5])

c = df.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
print(c.head(10))
c.head(1)['quantity']

                                                    order_id  quantity
choice_description                                                    
[Diet Coke]                                           123455       159
[Coke]                                                122752       143
[Sprite]                                               80426        89
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese...     43088        49
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese...     36041        42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese...     37550        40
[Lemonade]                                             31892        36
[Fresh Tomato Salsa (Mild), [Pinto Beans, Rice,...     24432        36
[Coca Cola]                                            19282        32
[Fresh Tomato Salsa, [Rice, Cheese, Sour Cream,...     29614        30


choice_description
[Diet Coke]    159
Name: quantity, dtype: int64

### Q7 Turn the item price into a float

In [60]:
# reload DF
import numpy as np
df = pd.read_csv(url, sep = '\t')

# print first 5 rows and last 2 columns
# print(df.iloc[:5,3:])
print("-"*80)


print("Before\n\n",df.dtypes,"\n\n")
print("-"*80)
# new_Cabin = np.array([cabin[0] for cabin in char_cabin]) 
price=np.array([item[1:] for item in df["item_price"]])
print(price)
df["item_price"]=price
df["item_price"]=df["item_price"].astype('float64') 
print(df.head())
# x[1:-1]: remove $ from item_price($2.39)
# dollar = lambda x: float(x[1:-1])
# df.item_price = df.item_price.apply(dollar)


print("After\n\n",df.dtypes)

--------------------------------------------------------------------------------
Before

 order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object 


--------------------------------------------------------------------------------
['2.39 ' '3.39 ' '3.39 ' ... '11.25 ' '8.75 ' '8.75 ']
   order_id  quantity                              item_name  \
0         1         1           Chips and Fresh Tomato Salsa   
1         1         1                                   Izze   
2         1         1                       Nantucket Nectar   
3         1         1  Chips and Tomatillo-Green Chili Salsa   
4         2         2                           Chicken Bowl   

                                  choice_description  item_price  
0                                                NaN        2.39  
1                                       [Clementine]        3.39  
2                         

### Q8 How much was the revenue for the period in the dataset?


In [61]:
# revenue = sum(quantity*item_price)

revenue = (df['quantity']* df['item_price']).sum()
print(revenue)
print('Revenue was: $' + str(np.round(revenue,2)))

39237.02
Revenue was: $39237.02


### Q9 print a data frame with only two columns, item_name and item_price?

In [69]:
print(df.head())
# delete the duplicates in item_name and quantity
filtered = df.drop_duplicates(['item_name','quantity'])
print(100*"-")
print(filtered.head())
# select only the products with quantity equals to 1
one_prod = filtered[filtered.quantity == 1]

# select only the item_name and item_price columns
price_per_item = one_prod[['item_name', 'item_price']]

# sort the values from the most to less expensive
price_per_item.sort_values(by = "item_price", ascending = False).head()

   order_id  quantity                              item_name  \
0         1         1           Chips and Fresh Tomato Salsa   
1         1         1                                   Izze   
2         1         1                       Nantucket Nectar   
3         1         1  Chips and Tomatillo-Green Chili Salsa   
4         2         2                           Chicken Bowl   

                                  choice_description  item_price  
0                                                NaN        2.39  
1                                       [Clementine]        3.39  
2                                            [Apple]        3.39  
3                                                NaN        2.39  
4  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...       16.98  
----------------------------------------------------------------------------------------------------
   order_id  quantity                              item_name  \
0         1         1           Chips and Fresh 

Unnamed: 0,item_name,item_price
606,Steak Salad Bowl,11.89
1229,Barbacoa Salad Bowl,11.89
1132,Carnitas Salad Bowl,11.89
7,Steak Burrito,11.75
168,Barbacoa Crispy Tacos,11.75


### Q10 What was the quantity of the most expensive item ordered?

In [377]:
df.sort_values(by = "item_price", ascending = False).head(1)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
3598,1443,15,Chips and Fresh Tomato Salsa,,44.25


### Q11 How many times were a Veggie Salad Bowl ordered?

In [73]:
#df[df.item_name == "Veggie Salad Bowl"].head()
df[df.item_name == "Veggie Salad Bowl"].count()

order_id              18
quantity              18
item_name             18
choice_description    18
item_price            18
dtype: int64