In [1]:
import pandas as pd

Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.

Pandas is built on dataframes.
A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQLDataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc. table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.


In [3]:
#Create a dataframe using dictionary
df1 = pd.DataFrame({
  'Product ID': [1, 2, 3, 4],
  # add Product Name and Color here
  'Product Name':['t-shirt','t-shirt','skirt','shirt'],
  'Color':['blue','green','red','black']
})

print(df1)

   Product ID Product Name  Color
0           1      t-shirt   blue
1           2      t-shirt  green
2           3        skirt    red
3           4        shirt  black


In [4]:
#Create a dataframe using lists
df2 = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
],
  columns = [
    'Store ID', 'Location', 'Number of Employees'
  ])

print(df2)

   Store ID       Location  Number of Employees
0         1      San Diego                  100
1         2    Los Angeles                  120
2         3  San Francisco                   90
3         4     Sacramento                  115


Most of the time we will be working with datasets from external files like CSV,Excel,Feather etc

In [5]:
#Read csv
df = pd.read_csv('world_ind_pop_data.csv')

In [6]:
#Print first 5 rows
df.head()

Unnamed: 0,CountryName,CountryCode,Year,Total Population,Urban population (% of total)
0,Arab World,ARB,1960,92495900.0,31.285384
1,Caribbean small states,CSS,1960,4190810.0,31.59749
2,Central Europe and the Baltics,CEB,1960,91401580.0,44.507921
3,East Asia & Pacific (all income levels),EAS,1960,1042475000.0,22.471132
4,East Asia & Pacific (developing only),EAP,1960,896493000.0,16.917679


In [7]:
#last 5 rows
df.tail()

Unnamed: 0,CountryName,CountryCode,Year,Total Population,Urban population (% of total)
13369,Virgin Islands (U.S.),VIR,2014,104170.0,95.203
13370,West Bank and Gaza,WBG,2014,4294682.0,75.026
13371,"Yemen, Rep.",YEM,2014,26183676.0,34.027
13372,Zambia,ZMB,2014,15721343.0,40.472
13373,Zimbabwe,ZWE,2014,15245855.0,32.501


In [8]:
#info about the dataframe created
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13374 entries, 0 to 13373
Data columns (total 5 columns):
CountryName                      13374 non-null object
CountryCode                      13374 non-null object
Year                             13374 non-null int64
Total Population                 13374 non-null float64
Urban population (% of total)    13374 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 522.5+ KB


A dataframe is made of rows and columns. Each column is called a series . 

In [9]:
#Selecting a column
CountryName = df.CountryName #Can also use df["CountryName"]

In [11]:
CountryName.head()

0                                 Arab World
1                     Caribbean small states
2             Central Europe and the Baltics
3    East Asia & Pacific (all income levels)
4      East Asia & Pacific (developing only)
Name: CountryName, dtype: object

In [13]:
#Create a new dataframe which is a sub dataframe of the current one
country_info = df[["CountryName","CountryCode"]]

Selecting a single row. Panda dataframes are indexed similar to numpy arrays.

In [15]:
#select 3rd row
df.iloc[2]

CountryName                      Central Europe and the Baltics
CountryCode                                                 CEB
Year                                                       1960
Total Population                                    9.14016e+07
Urban population (% of total)                           44.5079
Name: 2, dtype: object

In [16]:
type(df.iloc[2])

pandas.core.series.Series

Selecting based on logical conditions

In [17]:
df3 = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

january = df3[df3.month == "January"]

In [18]:
january

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,January,100,100,23,100


In [21]:
march_april = df3[(df3.month == "March") | (df3.month == "April")]
print(march_april)

   month  clinic_east  clinic_north  clinic_south  clinic_west
2  March           81            96            65           96
3  April           80            80            54          180


In [22]:
january_february_march = df3[df3.month.isin(["January","February","March"])]
print(january_february_march)

      month  clinic_east  clinic_north  clinic_south  clinic_west
0   January          100           100            23          100
1  February           51            45           145           45
2     March           81            96            65           96
