In [1]:
import pandas as pd

# DataFrames I: methods, insert, nan,sorting, and ranking

## Methods and Attributes between Series and DataFrames
- A **DataFrame** is a 2-dimensional table consisting of rows and columns.
- Pandas uses a `NaN` designation for cells that have a missing value. It is short for "not a number". Most operations on `NaN` values will produce `NaN` values.
- Like with a **Series**, Pandas assigns an index position/label to each **DataFrame** row.
- The **DataFrame** and **Series** have common and exclusive methods/attributes.
- The `hasnans` attribute exists only a **Series**. The `columns` attribute exists only on a **DataFrame**.
- Some methods/attributes will return different types of data.
- The `info` method returns a summary of the pandas object.

In [2]:
nba = pd.read_csv("nba.csv")
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [3]:
s = pd.Series([1,2,3,4,5])

In [4]:
# common methods to compare series and df
nba.tail(3)

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0
591,,,,,,,


In [5]:
s.index
nba.index

RangeIndex(start=0, stop=592, step=1)

In [6]:
s.values

array([1, 2, 3, 4, 5])

In [7]:
nba.values

array([['Saddiq Bey', 'Atlanta Hawks', 'F', ..., 215.0, 'Villanova',
        4556983.0],
       ['Bogdan Bogdanovic', 'Atlanta Hawks', 'G', ..., 225.0,
        'Fenerbahce', 18700000.0],
       ['Kobe Bufkin', 'Atlanta Hawks', 'G', ..., 195.0, 'Michigan',
        4094244.0],
       ...,
       ['Tristan Vukcevic', 'Washington Wizards', 'F', ..., 220.0,
        'Real Madrid', nan],
       ['Delon Wright', 'Washington Wizards', 'G', ..., 185.0, 'Utah',
        8195122.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

In [8]:
print(s.shape)
print(nba.shape)

(5,)
(592, 7)


In [9]:
print(s.dtypes)
print("\n")
print(nba.dtypes)

int64


Name         object
Team         object
Position     object
Height       object
Weight      float64
College      object
Salary      float64
dtype: object


In [10]:
print(s.hasnans)
# df doesn't have hasnans

False


In [11]:
print(nba.columns)
# series doesn't have cols
# rows and columns are all called index

Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')


In [12]:
print(s.axes)
print(nba.axes)

[RangeIndex(start=0, stop=5, step=1)]
[RangeIndex(start=0, stop=592, step=1), Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')]


In [13]:
# info() method
print(s.info())
print("\n")
print(nba.info())

<class 'pandas.core.series.Series'>
RangeIndex: 5 entries, 0 to 4
Series name: None
Non-Null Count  Dtype
--------------  -----
5 non-null      int64
dtypes: int64(1)
memory usage: 168.0 bytes
None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592 entries, 0 to 591
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      591 non-null    object 
 1   Team      591 non-null    object 
 2   Position  584 non-null    object 
 3   Height    585 non-null    object 
 4   Weight    584 non-null    float64
 5   College   578 non-null    object 
 6   Salary    488 non-null    float64
dtypes: float64(2), object(5)
memory usage: 32.5+ KB
None


## Differences between Series and Dfs
sum

In [14]:
revenue = pd.read_csv("revenue.csv",index_col=["Date"])
revenue.head(2)

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/26,985,122,499
1/2/26,738,788,534


In [15]:
s = revenue["New York"]
s.sum(axis="index")

5475

In [16]:
revenue.sum() #sum per col

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [17]:
display(revenue.sum(axis="columns"))
display(revenue.sum(axis="rows"))
display(revenue.sum(axis=1))
display(revenue.sum(axis=0))
display(revenue.sum(axis="columns").sum())

Date
1/1/26     1606
1/2/26     2060
1/3/26      967
1/4/26     2519
1/5/26      438
1/6/26     1935
1/7/26     1234
1/8/26     2313
1/9/26     2623
1/10/26     555
dtype: int64

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

Date
1/1/26     1606
1/2/26     2060
1/3/26      967
1/4/26     2519
1/5/26      438
1/6/26     1935
1/7/26     1234
1/8/26     2313
1/9/26     2623
1/10/26     555
dtype: int64

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

16250

## select one or more columns of data/series from df, add a new column
The nba.Team ONLY works when there is no space in the col name :( 
- The **Series** is a view, so changes to the **Series** *will* affect the **DataFrame**.
- Pandas will display a warning if you mutate the **Series**. Use the `copy` method to create a duplicate.

In [18]:
nba = pd.read_csv("nba.csv")


display(nba.Team.iloc[0:3])

display(nba["Team"].iloc[0:3])

0    Atlanta Hawks
1    Atlanta Hawks
2    Atlanta Hawks
Name: Team, dtype: object

0    Atlanta Hawks
1    Atlanta Hawks
2    Atlanta Hawks
Name: Team, dtype: object

In [19]:
nba["Team"].iloc[0] = "XYA"
nba.head()

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  nba["Team"].iloc[0] = "XYA"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nba["Team"].iloc[0] = "XYA"


Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,XYA,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [20]:
Names = nba["Team"].copy()
Names.iloc[0] = "XYAsdaf"
nba.head() # not affected because Names is its own type

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,XYA,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [21]:
nba[["Team","College","Name"]].iloc[0:4]

Unnamed: 0,Team,College,Name
0,XYA,Villanova,Saddiq Bey
1,Atlanta Hawks,Fenerbahce,Bogdan Bogdanovic
2,Atlanta Hawks,Michigan,Kobe Bufkin
3,Atlanta Hawks,Elan Chalon,Clint Capela


In [22]:
nba["sports"] = "Basketball"
nba["salary_normalized"] = nba["Salary"].apply(lambda x: x/10 if x%2==0 else x)
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary,sports,salary_normalized
0,Saddiq Bey,XYA,F,6-7,215.0,Villanova,4556983.0,Basketball,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0,Basketball,1870000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0,Basketball,409424.4
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0,Basketball,2061600.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0,Basketball,258152.2


### insert a new col at an index

In [23]:
nba.insert(loc=3,column="another_salary",value=nba["Salary"].apply(lambda x: x/10 if x%4==0 else x))
nba.head()

Unnamed: 0,Name,Team,Position,another_salary,Height,Weight,College,Salary,sports,salary_normalized
0,Saddiq Bey,XYA,F,4556983.0,6-7,215.0,Villanova,4556983.0,Basketball,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,1870000.0,6-5,225.0,Fenerbahce,18700000.0,Basketball,1870000.0
2,Kobe Bufkin,Atlanta Hawks,G,409424.4,6-5,195.0,Michigan,4094244.0,Basketball,409424.4
3,Clint Capela,Atlanta Hawks,C,2061600.0,6-10,256.0,Elan Chalon,20616000.0,Basketball,2061600.0
4,Bruno Fernando,Atlanta Hawks,F-C,2581522.0,6-10,240.0,Maryland,2581522.0,Basketball,258152.2


### value_counts method in df

In [24]:
nba = pd.read_csv("nba.csv")
nba["Team"].value_counts().iloc[0:4]

Team
Dallas Mavericks    23
Miami Heat          22
Denver Nuggets      22
Milwaukee Bucks     22
Name: count, dtype: int64

In [25]:
nba["Team"].value_counts(normalize=True).iloc[0:5]

Team
Dallas Mavericks     0.038917
Miami Heat           0.037225
Denver Nuggets       0.037225
Milwaukee Bucks      0.037225
Memphis Grizzlies    0.037225
Name: proportion, dtype: float64

In [26]:
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


## Important: Drop missing values from rows: dropna()
how, subset, ignore_index

In [27]:
nba.dropna(how="all",ignore_index=True) #all empty

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,


In [28]:
nba.dropna(subset=["Salary"])

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
585,Eugene Omoruyi,Washington Wizards,F,6-6,235.0,Oregon,559782.0
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0


In [29]:
nba.dropna(how="any",ignore_index=True)

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
470,Eugene Omoruyi,Washington Wizards,F,6-6,235.0,Oregon,559782.0
471,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
472,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
473,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0


## Important: Filling missing values from rows: fillna()

In [30]:
nba = nba.dropna(how="all")

In [31]:
#nba.fillna("NULL")
nba["Salary"].fillna("0")
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,


In [32]:
nba["College"].fillna(value="UNKNOWN")

0          Villanova
1         Fenerbahce
2           Michigan
3        Elan Chalon
4           Maryland
           ...      
586         Michigan
587           Toledo
588    Wichita State
589      Real Madrid
590             Utah
Name: College, Length: 591, dtype: object

## the astype method: nan issue, category type
important: if there is any empty value, df automatically assigns float to the col

- The `astype` method converts a **Series's** values to a specified type.
- Pass in the specified type as either a string or the core Python data type.
- Pandas cannot convert `NaN` values to numeric types, so we need to eliminate/replace them before we perform the conversion.
- The `dtypes` attribute returns a **Series** with the **DataFrame's** columns and their types.

- The `category` type is ideal for columns with a limited number of unique values > you use it to reduce the size the table
- The `nunique` method will return a **Series** with the number of unique values in each column.
- With categories, pandas does not create a separate value in memory for each "cell". Rather, the cells point to a single copy for each unique value.

In [33]:
nba2 = nba.copy()

In [34]:
nba2 = nba2.dropna(how="all")
nba2["Salary"] = nba["Salary"].fillna(0)
nba2

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0.0


In [35]:
nba.dtypes

Name         object
Team         object
Position     object
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [36]:
nba2["Salary"].astype("int")

0       4556983
1      18700000
2       4094244
3      20616000
4       2581522
         ...   
586    27955357
587     1719864
588    10250000
589           0
590     8195122
Name: Salary, Length: 591, dtype: int64

In [37]:
# sometimes we call a col with limited set of short values category

nba["Team"].value_counts()

Team
Dallas Mavericks          23
Miami Heat                22
Denver Nuggets            22
Milwaukee Bucks           22
Memphis Grizzlies         22
Indiana Pacers            21
Utah Jazz                 21
Toronto Raptors           21
Philadelphia 76ers        21
Oklahoma City Thunder     21
New York Knicks           21
Washington Wizards        21
Phoenix Suns              20
Houston Rockets           20
Charlotte Hornets         20
San Antonio Spurs         20
Los Angeles Clippers      19
Minnesota Timberwolves    19
Detroit Pistons           19
Cleveland Cavaliers       19
Los Angeles Lakers        19
Chicago Bulls             19
Sacramento Kings          18
Orlando Magic             18
Boston Celtics            18
Atlanta Hawks             18
Portland Trail Blazers    17
Golden State Warriors     17
Brooklyn Nets             17
New Orleans Pelicans      16
Name: count, dtype: int64

In [38]:
# get counts of uniques
nba.nunique()

Name        591
Team         30
Position      7
Height       20
Weight       93
College     182
Salary      298
dtype: int64

In [39]:
nba.info()

#memory usage

<class 'pandas.core.frame.DataFrame'>
Index: 591 entries, 0 to 590
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      591 non-null    object 
 1   Team      591 non-null    object 
 2   Position  584 non-null    object 
 3   Height    585 non-null    object 
 4   Weight    584 non-null    float64
 5   College   578 non-null    object 
 6   Salary    488 non-null    float64
dtypes: float64(2), object(5)
memory usage: 36.9+ KB


In [40]:
nba["Position"] = nba["Position"].astype("category")
nba.info()

<class 'pandas.core.frame.DataFrame'>
Index: 591 entries, 0 to 590
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Name      591 non-null    object  
 1   Team      591 non-null    object  
 2   Position  584 non-null    category
 3   Height    585 non-null    object  
 4   Weight    584 non-null    float64 
 5   College   578 non-null    object  
 6   Salary    488 non-null    float64 
dtypes: category(1), float64(2), object(4)
memory usage: 33.2+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nba["Position"] = nba["Position"].astype("category")


## sort_values in df (single)
- at the df level you need by= col_name
- by default, NaNs are always at the end of the sorting unless using na_position

In [41]:
nba = pd.read_csv("nba.csv")

In [42]:
nba["Name"].sort_values()

122        A.J. Lawson
324           AJ Green
6           AJ Griffin
141       Aaron Gordon
198      Aaron Holiday
            ...       
83         Zach LaVine
149         Zeke Nnaji
291    Ziaire Williams
370    Zion Williamson
591                NaN
Name: Name, Length: 592, dtype: object

In [43]:
nba.sort_values(by="Name",ascending=False)

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
370,Zion Williamson,New Orleans Pelicans,F,6-6,284.0,Duke,34005250.0
291,Ziaire Williams,Memphis Grizzlies,F,6-9,185.0,Stanford,4810200.0
149,Zeke Nnaji,Denver Nuggets,F-C,6-9,240.0,Arizona,4306281.0
83,Zach LaVine,Chicago Bulls,G,6-5,200.0,UCLA,40064220.0
515,Zach Collins,San Antonio Spurs,F-C,6-11,250.0,Gonzaga,7700000.0
...,...,...,...,...,...,...,...
141,Aaron Gordon,Denver Nuggets,F,6-8,235.0,Arizona,22266182.0
6,AJ Griffin,Atlanta Hawks,F,6-6,220.0,Duke,3712920.0
324,AJ Green,Milwaukee Bucks,G,6-5,190.0,Northern Iowa,1901769.0
122,A.J. Lawson,Dallas Mavericks,G,6-6,179.0,South Carolina,


In [44]:
nba.sort_values(by="Name",ascending=False,na_position="first")

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
591,,,,,,,
370,Zion Williamson,New Orleans Pelicans,F,6-6,284.0,Duke,34005250.0
291,Ziaire Williams,Memphis Grizzlies,F,6-9,185.0,Stanford,4810200.0
149,Zeke Nnaji,Denver Nuggets,F-C,6-9,240.0,Arizona,4306281.0
83,Zach LaVine,Chicago Bulls,G,6-5,200.0,UCLA,40064220.0
...,...,...,...,...,...,...,...
198,Aaron Holiday,Houston Rockets,G,6-0,185.0,UCLA,2346614.0
141,Aaron Gordon,Denver Nuggets,F,6-8,235.0,Arizona,22266182.0
6,AJ Griffin,Atlanta Hawks,F,6-6,220.0,Duke,3712920.0
324,AJ Green,Milwaukee Bucks,G,6-5,190.0,Northern Iowa,1901769.0


## sort_values in df (multi)
the list arg keeps the ordering of multi-sorting; ascending can allow a list

In [45]:
nba.sort_values(by=["Team","Name"]) 

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
6,AJ Griffin,Atlanta Hawks,F,6-6,220.0,Duke,3712920.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
8,De'Andre Hunter,Atlanta Hawks,F-G,6-8,221.0,Virginia,20089286.0
...,...,...,...,...,...,...,...
578,Taj Gibson,Washington Wizards,F,6-9,232.0,Southern California,
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
580,Tyus Jones,Washington Wizards,G,6-2,196.0,Duke,14000000.0
573,Xavier Cooks,Washington Wizards,F,6-8,183.0,Winthrop,1719864.0


In [46]:
nba.sort_values(by=["Team","Name"],ascending=[True,False]) 

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
12,Wesley Matthews,Atlanta Hawks,G,6-5,220.0,Marquette,3196448.0
5,Trent Forrest,Atlanta Hawks,G,6-4,210.0,Florida State,508891.0
17,Trae Young,Atlanta Hawks,G,6-1,164.0,Oklahoma,40064220.0
10,Seth Lundy,Atlanta Hawks,G,6-6,220.0,Penn State,559782.0
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
...,...,...,...,...,...,...,...
576,Daniel Gafford,Washington Wizards,F-C,6-10,234.0,Arkansas,12402000.0
581,Corey Kispert,Washington Wizards,F,6-6,224.0,Gonzaga,3722040.0
574,Bilal Coulibaly,Washington Wizards,G,6-6,195.0,Metropolitans 92,6614256.0
579,Anthony Gill,Washington Wizards,F,6-8,230.0,Virginia,1997238.0


## Sort a DataFrame by its Index

In [47]:
nba.sort_index(ascending=False)

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
591,,,,,,,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
...,...,...,...,...,...,...,...
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0


## Ranking method
Remeber to convert the float rank back to int

In [48]:
nba2 = nba.copy()
nba2["Salary"] = nba2["Salary"].fillna(0).astype(int)

In [49]:
nba2["Salary Rank"] = nba2["Salary"].rank(ascending = False).astype(int)

In [50]:
nba2.sort_values("Salary Rank")

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary,Salary Rank
175,Stephen Curry,Golden State Warriors,G,6-2,185.0,Davidson,51915615,1
461,Kevin Durant,Phoenix Suns,F,6-10,240.0,Texas,47649433,2
261,LeBron James,Los Angeles Lakers,F,6-9,250.0,St. Vincent-St. Mary HS (OH),47607350,4
145,Nikola Jokic,Denver Nuggets,C,6-11,284.0,Mega Basket,47607350,4
436,Joel Embiid,Philadelphia 76ers,C-F,7-0,280.0,Kansas,47607350,4
...,...,...,...,...,...,...,...,...
210,Cam Whitmore,Houston Rockets,F,6-7,230.0,Villanova,0,540
122,A.J. Lawson,Dallas Mavericks,G,6-6,179.0,South Carolina,0,540
123,Dereck Lively II,Dallas Mavericks,C,7-1,230.0,Duke,0,540
204,Jermaine Samuels Jr.,Houston Rockets,F,6-7,230.0,Villanova,0,540


# DataFrames II: filtering data

In [51]:
emp = pd.read_csv("employees.csv")

In [52]:
emp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


In [53]:
emp

# need to process the datetime cols

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


## claim dtypes: to_datetime, bool, and other type reassignments
- dt.time to keep only H:M, no year/month/day
- why reassigning? save memory
- convert datetime directly at read_csv()

In [54]:
emp["Start Date"] = pd.to_datetime(emp["Start Date"],format="%m/%d/%Y")

In [55]:
emp["Last Login Time"] = pd.to_datetime(emp["Last Login Time"],format="%H:%M %p").dt.time

In [56]:
emp["Senior Management"] = emp["Senior Management"].astype(bool)

In [57]:
emp["Gender"] = emp["Gender"].astype("category")

In [58]:
emp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    category      
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  1000 non-null   bool          
 7   Team               957 non-null    object        
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 49.1+ KB


In [59]:
# in summary:

emp = pd.read_csv("employees.csv",parse_dates=["Start Date"],date_format="%m/%d/%Y")
emp["Last Login Time"] = pd.to_datetime(emp["Last Login Time"],format="%H:%M %p").dt.time
emp["Senior Management"] = emp["Senior Management"].astype(bool)
emp["Gender"] = emp["Gender"].astype("category")
emp.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


## filtering(single)
- need to put a boolean series inside a df
- sometimes need datetime/dt

In [60]:
# boolean series: emp["Gender"] == "Male" is an iterative bool values

emp[emp["Gender"] == "Male"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [61]:
emp[emp["Team"] == "Finance"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
14,Kimberly,Female,1999-01-14,07:13:00,41426,14.543,True,Finance
46,Bruce,Male,2009-11-28,10:47:00,114796,6.796,False,Finance
...,...,...,...,...,...,...,...,...
907,Elizabeth,Female,1998-07-27,11:12:00,137144,10.081,False,Finance
954,Joe,Male,1980-01-19,04:06:00,119667,1.148,True,Finance
987,Gloria,Female,2014-12-08,05:08:00,136709,10.331,True,Finance
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance


In [62]:
emp[emp["Bonus %"] < 1.2]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
217,Douglas,Male,1999-09-03,04:00:00,83341,1.015,True,Client Services
273,Nicholas,Male,1994-04-12,08:21:00,74669,1.113,True,Product
279,Ruby,Female,2000-11-08,07:35:00,105946,1.139,False,Business Development
365,Gloria,,1983-07-19,01:57:00,140885,1.113,False,Human Resources
481,,Female,2013-04-27,06:40:00,93847,1.085,True,Business Development
527,Helen,,1993-12-02,01:42:00,45724,1.022,False,Product
579,Harold,Male,2010-10-18,08:45:00,65673,1.187,True,Legal
652,Willie,Male,2009-12-05,05:39:00,141932,1.017,True,Engineering
708,Steve,Male,2002-01-11,09:17:00,51821,1.197,True,Legal
746,Gloria,Female,2004-08-19,10:31:00,46602,1.027,True,Business Development


In [63]:
emp[emp["Start Date"] < "1981/01/01"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,1980-08-12,09:01:00,63241,15.132,True,
12,Brandon,Male,1980-12-01,01:08:00,112807,17.492,True,Human Resources
43,Marilyn,Female,1980-12-07,03:16:00,73524,5.207,True,Marketing
45,Roger,Male,1980-04-17,11:32:00,88010,13.886,True,Sales
49,Chris,,1980-01-24,12:13:00,113590,3.055,False,Sales
82,Steven,Male,1980-03-30,09:20:00,35095,8.379,True,Client Services
154,Rebecca,Female,1980-11-15,04:13:00,85730,5.359,True,Product
213,Evelyn,Female,1980-05-24,11:10:00,81673,15.364,True,Engineering
272,Fred,Male,1980-02-20,02:25:00,74129,18.225,False,Product
303,Joan,,1980-07-25,12:22:00,38712,3.657,False,Client Services


In [64]:
import datetime as dt

dt.time(12,0,0)

emp[emp["Last Login Time"] < dt.time(12,0,0)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


## filtering(multiple): &

In [65]:
is_female = emp["Gender"]=="Female" 
is_marketing = emp["Team"]=="Marketing"

emp[is_female & is_marketing].iloc[0:10]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
43,Marilyn,Female,1980-12-07,03:16:00,73524,5.207,True,Marketing
62,,Female,2007-06-12,05:25:00,58112,19.414,True,Marketing
98,Tina,Female,2016-06-16,07:47:00,100705,16.961,True,Marketing
140,Shirley,Female,1981-02-28,01:23:00,113850,1.854,False,Marketing
158,Norma,Female,1999-02-28,08:45:00,114412,8.756,True,Marketing
201,Kimberly,Female,1997-07-15,05:57:00,36643,7.953,False,Marketing
220,,Female,1991-06-17,12:49:00,71945,5.56,True,Marketing
305,Margaret,Female,1993-02-06,01:05:00,125220,3.733,False,Marketing
319,Jacqueline,Female,1981-11-25,03:01:00,145988,18.243,False,Marketing
331,Evelyn,Female,1983-09-03,01:58:00,36759,17.269,True,Marketing


## filtering (multiple): |

In [66]:
is_senior_management = emp["Senior Management"]
started_in_80s = emp["Start Date"] < "1990-01-01"

emp[is_senior_management | started_in_80s]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance
993,Tina,Female,1997-05-15,03:53:00,56450,19.040,True,Engineering
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance


## filtering: one of the multiple conditions

In [67]:
target_teams = emp["Team"].isin(["Legal", "Sales", "Product"])
emp[target_teams]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,04:20:00,65476,10.012,True,Product
11,Julie,Female,1997-10-26,03:19:00,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,11:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,06:09:00,59414,1.256,False,Product
...,...,...,...,...,...,...,...,...
981,James,Male,1993-01-15,05:19:00,148985,19.280,False,Legal
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal
989,Justin,,1991-02-10,04:58:00,38344,3.794,False,Legal
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product


## filtering: isnull() and notnull()

In [68]:
emp[emp["Team"].isnull()]

emp[emp["Team"].notnull()]

emp[emp["First Name"].isnull() & emp["Team"].notnull()].iloc[0:3]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
25,,Male,2012-10-08,01:12:00,37076,18.576,True,Client Services
39,,Male,2016-01-29,02:33:00,122173,7.797,True,Client Services


## filtering: between()

In [69]:
emp[emp["Salary"].between(60000, 70000)]

emp[emp["Bonus %"].between(2.0, 5.0)]

emp[emp["Start Date"].between("1991-01-01", "1992-01-01")]

emp[emp["Last Login Time"].between(dt.time(8, 30), dt.time(12, 0))]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
10,Louise,Female,1980-08-12,09:01:00,63241,15.132,True,
13,Gary,Male,2008-01-27,11:40:00,109831,5.831,False,Sales
18,Diana,Female,1981-10-23,10:27:00,132940,19.082,False,Client Services
...,...,...,...,...,...,...,...,...
977,Sarah,Female,1995-12-04,09:16:00,124566,5.949,False,Product
982,Rose,Female,1982-04-06,10:43:00,91411,8.639,True,Human Resources
983,John,Male,1982-12-23,10:35:00,146907,11.738,False,Engineering
988,Alice,Female,2004-10-05,09:34:00,47638,11.209,False,Human Resources


## The duplicated Method: duplicated(boolean), drop_duplicates
- The `duplicated` method returns boolean
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

- drop_duplicates() can allow multiple columns in a list

In [70]:
emp["First Name"].duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
995     True
996     True
997     True
998     True
999     True
Name: First Name, Length: 1000, dtype: bool

In [71]:
emp.drop_duplicates()

emp.drop_duplicates("Team")
emp.drop_duplicates("Team", keep="first")
emp.drop_duplicates("Team", keep="last")
emp.drop_duplicates("Team", keep=False)

emp.drop_duplicates("First Name", keep=False)

emp.drop_duplicates(["Senior Management", "Team"]).sort_values("Team")

emp.drop_duplicates(["Senior Management", "Team"], keep="last").sort_values("Team")

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
971,Patrick,Male,2002-12-30,02:01:00,75423,5.368,True,Business Development
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development
965,Catherine,Female,1989-09-25,01:31:00,68164,18.393,False,Client Services
990,Robin,Female,1987-07-24,01:35:00,100765,10.982,True,Client Services
946,,Female,1985-09-15,01:50:00,133472,16.941,True,Distribution
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
993,Tina,Female,1997-05-15,03:53:00,56450,19.04,True,Engineering
984,Maria,Female,2011-10-15,04:53:00,43455,13.04,False,Engineering
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance


In [72]:
emp[~emp["First Name"].duplicated(keep=False)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
8,Angela,Female,2005-11-22,06:29:00,95570,18.523,True,Engineering
33,Jean,Female,1993-12-18,09:07:00,119082,16.18,False,Business Development
190,Carol,Female,1996-03-19,03:39:00,57783,9.129,False,Finance
291,Tammy,Female,1984-11-11,10:30:00,132839,17.463,True,Client Services
495,Eugene,Male,1984-05-24,10:54:00,81077,2.117,False,Sales
688,Brian,Male,2007-04-07,10:47:00,93901,17.821,True,Legal
832,Keith,Male,2003-02-12,03:02:00,120672,19.467,False,Legal
887,David,Male,2009-12-05,08:48:00,92242,15.407,False,Legal


## The unique and nunique Methods

The unique and nunique Methods
The unique method on a Series returns a collection of its unique values. The method does not exist on a DataFrame.
The nunique method returns a count of the number of unique values in the Series/DataFrame.
The **dropna** parameter configures whether to include or exclude missing (NaN) values.

In [73]:
emp.nunique()

First Name           200
Gender                 2
Start Date           972
Last Login Time      542
Salary               995
Bonus %              971
Senior Management      2
Team                  10
dtype: int64

In [74]:
emp["Team"].unique()

array(['Marketing', nan, 'Finance', 'Client Services', 'Legal', 'Product',
       'Engineering', 'Business Development', 'Human Resources', 'Sales',
       'Distribution'], dtype=object)

In [75]:
emp["Team"].nunique(dropna=False)

11

# DataFrames III: loc&iloc, writing and deleting

In [76]:
bond = pd.read_csv("jamesbond.csv")

## set_index, reset_index

- The `set_index` method sets an existing column as the index of the **DataFrame**.
- The `reset_index` method sets the standard ascending numeric index as the index of the **DataFrame**.

In [77]:
bond = bond.set_index("Film")
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [78]:
# back to normal
bond = bond.reset_index()
bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [79]:
bond = bond.reset_index().set_index("Year")
bond.head()

Unnamed: 0_level_0,index,Film,Actor,Director,Box Office,Budget,Bond Actor Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1962,0,Dr. No,Sean Connery,Terence Young,448.8,7.0,0.6
1963,1,From Russia with Love,Sean Connery,Terence Young,543.8,12.6,1.6
1964,2,Goldfinger,Sean Connery,Guy Hamilton,820.4,18.6,3.2
1965,3,Thunderball,Sean Connery,Terence Young,848.1,41.9,4.7
1967,4,Casino Royale,David Niven,Ken Hughes,315.0,85.0,


## iloc Accessor
- The `iloc` accessor retrieves one or more rows by index position.
- Provide a pair of square brackets after the accessor.
- `iloc` accepts single values, lists, and slices.

In [80]:
bond.reset_index().iloc[[15, 20]]

Unnamed: 0,Year,index,Film,Actor,Director,Box Office,Budget,Bond Actor Salary
15,1985,15,A View to a Kill,Roger Moore,John Glen,275.2,54.5,9.1
20,1999,20,The World Is Not Enough,Pierce Brosnan,Michael Apted,439.5,158.3,13.5


In [81]:
bond = bond.reset_index()
bond.iloc[0:6]

Unnamed: 0,Year,index,Film,Actor,Director,Box Office,Budget,Bond Actor Salary
0,1962,0,Dr. No,Sean Connery,Terence Young,448.8,7.0,0.6
1,1963,1,From Russia with Love,Sean Connery,Terence Young,543.8,12.6,1.6
2,1964,2,Goldfinger,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,1965,3,Thunderball,Sean Connery,Terence Young,848.1,41.9,4.7
4,1967,4,Casino Royale,David Niven,Ken Hughes,315.0,85.0,
5,1967,5,You Only Live Twice,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


In [82]:
bond.iloc[20:]

Unnamed: 0,Year,index,Film,Actor,Director,Box Office,Budget,Bond Actor Salary
20,1999,20,The World Is Not Enough,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
21,2002,21,Die Another Day,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,2006,22,Casino Royale,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,2008,23,Quantum of Solace,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,2012,24,Skyfall,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,2015,25,Spectre,Daniel Craig,Sam Mendes,726.7,206.3,30.0
26,2021,26,No Time to Die,Daniel Craig,Cary Joji Fukunaga,774.2,301.0,25.0


## loc
- The `loc` accessor retrieves one or more rows by index label.
- Provide a pair of square brackets after the accessor.
- loc can return duplicated rows if there are

In [83]:
bond = pd.read_csv("jamesbond.csv", index_col="Film")

bond.loc["Goldfinger"]
# bond.loc["Sacred Bond"]

Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: Goldfinger, dtype: object

In [84]:
# all rows for matching
bond.loc["Casino Royale"]
# bond.loc[:"Casino Royale"] will cause an error because this value is not unique

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [85]:
bond.loc[["Octopussy", "Moonraker"]]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,


## Second Arguments to loc and iloc Accessors
- The second value inside the square brackets targets the columns.
- The `iloc` requires numeric positions for rows and columns.
- The `loc` requires labels for rows and columns.

In [86]:
bond = pd.read_csv("jamesbond.csv", index_col="Film").sort_index()

bond.loc["GoldenEye":"Octopussy", "Director":"Budget"]

Unnamed: 0_level_0,Director,Box Office,Budget
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GoldenEye,Martin Campbell,518.5,76.9
Goldfinger,Guy Hamilton,820.4,18.6
Licence to Kill,John Glen,250.9,56.7
Live and Let Die,Guy Hamilton,460.3,30.8
Moonraker,Lewis Gilbert,535.0,91.5
Never Say Never Again,Irvin Kershner,380.0,86.0
No Time to Die,Cary Joji Fukunaga,774.2,301.0
Octopussy,John Glen,373.8,53.9


In [87]:
bond.iloc[[0, 2], [3, 5]]

Unnamed: 0_level_0,Box Office,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
A View to a Kill,275.2,9.1
Casino Royale,315.0,


In [88]:
bond.iloc[:7, :3]

Unnamed: 0_level_0,Year,Actor,Director
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A View to a Kill,1985,Roger Moore,John Glen
Casino Royale,2006,Daniel Craig,Martin Campbell
Casino Royale,1967,David Niven,Ken Hughes
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton
Die Another Day,2002,Pierce Brosnan,Lee Tamahori
Dr. No,1962,Sean Connery,Terence Young
For Your Eyes Only,1981,Roger Moore,John Glen


## Important: iloc - index number; loc - index label; 1st arg: rows; 2nd arg: columns

In [89]:
#overwrite with loc

bond.loc["Diamonds Are Forever", "Actor"] = "Sir Sean Connery"
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sir Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


##  Overwrite Multiple Values in a DataFrame
- The `replace` method replaces all occurrences of a **Series** value with another value (think of it like "Find and Replace").
- To overwrite multiple values in a **DataFrame**, remember to use an accessor on the **DataFrame** itself.
- Accessors like `loc` and `iloc` can accept Boolean Series. Use them to target the values to overwrite.

In [90]:
#replace all cases of Sean Connery

bond["Actor"] = bond["Actor"].replace("Sean Connery", "Sir Sean Connery")
bond

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sir Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sir Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sir Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sir Sean Connery,Guy Hamilton,820.4,18.6,3.2


## rename index
- arg is col_dictionary {old_name:new_name}

In [91]:
bond.rename(columns={ "Year": "Year of Release", "Box Office": "Revenue" })

Unnamed: 0_level_0,Year of Release,Actor,Director,Revenue,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sir Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sir Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sir Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sir Sean Connery,Guy Hamilton,820.4,18.6,3.2


## drop values

- The `drop` method deletes one or more rows/columns from a **DataFrame**.
- Pass the `index` or `columns` parameters a list of the column names to remove.
- The `pop` method removes and returns a single **Series** (it mutates the **DataFrame** in the process) > change the result from df to series
- Python's `del` keyword also removes a single **Series**.

In [92]:
bond.drop(index=["No Time to Die", "Casino Royale"], columns=["Box Office", "Budget"])

Unnamed: 0_level_0,Year,Actor,Director,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A View to a Kill,1985,Roger Moore,John Glen,9.1
Diamonds Are Forever,1971,Sir Sean Connery,Guy Hamilton,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,17.9
Dr. No,1962,Sir Sean Connery,Terence Young,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,
From Russia with Love,1963,Sir Sean Connery,Terence Young,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,5.1
Goldfinger,1964,Sir Sean Connery,Guy Hamilton,3.2
Licence to Kill,1989,Timothy Dalton,John Glen,7.9
Live and Let Die,1973,Roger Moore,Guy Hamilton,


In [93]:
# delete col
del bond["Year"]

In [94]:
actor = bond.pop("Actor")
actor.head()

Film
A View to a Kill             Roger Moore
Casino Royale               Daniel Craig
Casino Royale                David Niven
Diamonds Are Forever    Sir Sean Connery
Die Another Day           Pierce Brosnan
Name: Actor, dtype: object

## random sampling: axis & random_state

In [95]:
bond.sample()
bond.sample(n=5)

bond.sample(n=2, axis="columns")

Unnamed: 0_level_0,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
A View to a Kill,54.5,9.1
Casino Royale,145.3,3.3
Casino Royale,85.0,
Diamonds Are Forever,34.7,5.8
Die Another Day,154.2,17.9
Dr. No,7.0,0.6
For Your Eyes Only,60.2,
From Russia with Love,12.6,1.6
GoldenEye,76.9,5.1
Goldfinger,18.6,3.2


In [96]:
bond.sample(n=3, axis="rows")

Unnamed: 0_level_0,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Casino Royale,Martin Campbell,581.5,145.3,3.3
Live and Let Die,Guy Hamilton,460.3,30.8,
Octopussy,John Glen,373.8,53.9,7.8


In [97]:
bond.sample(n=5) #default is row

Unnamed: 0_level_0,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dr. No,Terence Young,448.8,7.0,0.6
Diamonds Are Forever,Guy Hamilton,442.5,34.7,5.8
Octopussy,John Glen,373.8,53.9,7.8
You Only Live Twice,Lewis Gilbert,514.2,59.9,4.4
Casino Royale,Martin Campbell,581.5,145.3,3.3


In [98]:
bond.sample(n=3, random_state=22)

Unnamed: 0_level_0,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The World Is Not Enough,Michael Apted,439.5,158.3,13.5
You Only Live Twice,Lewis Gilbert,514.2,59.9,4.4
Thunderball,Terence Young,848.1,41.9,4.7


## The nsmallest and nlargest Methods: faster than sort_values() at scale; ONLY for numeric
- The `nlargest` method returns a specified number of rows with the largest values from a given column.
- The `nsmallest` method returns rows with the smallest values from a given column.
- The `nlargest` and `nsmallest` methods are more efficient than sorting the entire **DataFrame**.

In [99]:
# Retrieve the 4 films with the highest/largest Box Office gross
bond.sort_values("Box Office", ascending=False).head(4)

Unnamed: 0_level_0,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Skyfall,Sam Mendes,943.5,170.2,14.5
Thunderball,Terence Young,848.1,41.9,4.7
Goldfinger,Guy Hamilton,820.4,18.6,3.2
No Time to Die,Cary Joji Fukunaga,774.2,301.0,25.0


In [100]:
bond.nlargest(n=4, columns="Box Office")
bond["Box Office"].nlargest(4)

Film
Skyfall           943.5
Thunderball       848.1
Goldfinger        820.4
No Time to Die    774.2
Name: Box Office, dtype: float64

In [101]:
bond.nsmallest(3, columns="Bond Actor Salary")
bond["Bond Actor Salary"].nsmallest(3)

Film
Dr. No                             0.6
On Her Majesty's Secret Service    0.6
From Russia with Love              1.6
Name: Bond Actor Salary, dtype: float64

## Filtering with where: compared to loc or df[col] it keeps Nans

In [102]:
bond = pd.read_csv("jamesbond.csv", index_col="Film").sort_index()

In [103]:
actor_is_sean_connery = bond["Actor"] == "Sean Connery"
#bond[actor_is_sean_connery]
#bond.loc[actor_is_sean_connery]
bond.where(actor_is_sean_connery)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,,,,,,
Casino Royale,,,,,,
Casino Royale,,,,,,
Diamonds Are Forever,1971.0,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,,,,,,
Dr. No,1962.0,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,,,,,,
From Russia with Love,1963.0,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,,,,,,
Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2


## Apply in df

In [104]:
bond["Actor"].apply(len)

Film
A View to a Kill                   11
Casino Royale                      12
Casino Royale                      11
Diamonds Are Forever               12
Die Another Day                    14
Dr. No                             12
For Your Eyes Only                 11
From Russia with Love              12
GoldenEye                          14
Goldfinger                         12
Licence to Kill                    14
Live and Let Die                   11
Moonraker                          11
Never Say Never Again              12
No Time to Die                     12
Octopussy                          11
On Her Majesty's Secret Service    14
Quantum of Solace                  12
Skyfall                            12
Spectre                            12
The Living Daylights               14
The Man with the Golden Gun        11
The Spy Who Loved Me               11
The World Is Not Enough            14
Thunderball                        12
Tomorrow Never Dies                14
You Onl

In [105]:
# MOVIE RANKING SYSTEM
#
# CONDITION      -> DESIGNATION
# 80s movie      -> "Great 80's flick"
# Pierce Brosnan -> "The best Bond ever"
# Budget > 100   -> "Expensive movie, fun"
# Others         -> "No comment"

def rank_movie(row):
    year = row.loc["Year"]
    actor = row.loc["Actor"]
    budget = row.loc["Budget"]

    if year >= 1980 and year < 1990:
        return "Great 80's flick!"

    if actor == "Pierce Brosnan":
        return "The best Bond ever!"

    if budget > 100:
        return "Expensive movie, fun"

    return "No comment"

bond.apply(rank_movie, axis="columns")

Film
A View to a Kill                      Great 80's flick!
Casino Royale                      Expensive movie, fun
Casino Royale                                No comment
Diamonds Are Forever                         No comment
Die Another Day                     The best Bond ever!
Dr. No                                       No comment
For Your Eyes Only                    Great 80's flick!
From Russia with Love                        No comment
GoldenEye                           The best Bond ever!
Goldfinger                                   No comment
Licence to Kill                       Great 80's flick!
Live and Let Die                             No comment
Moonraker                                    No comment
Never Say Never Again                 Great 80's flick!
No Time to Die                     Expensive movie, fun
Octopussy                             Great 80's flick!
On Her Majesty's Secret Service              No comment
Quantum of Solace                  Expensiv