**Note**: These notes are a summary of notes taken from www.codecademy.com, not my production

1) Import Pandas and Numpy


In [1]:
import numpy as np
import pandas as pd

2) NumPy Arrays: NumPy arrays are unique in that they are more flexible than normal Python lists. They are called ndarrays since they can have any number (n) of dimensions (d). They hold a collection of items of any one data type and can be either a vector (one-dimensional) or a matrix (multi-dimensional). NumPy arrays allow for fast element access and efficient data manipulation.

The code below initializes a Python list named list1:


In [5]:
list1 = [1,2,3,4]


To convert this to a one-dimensional ndarray with one row and four columns, we can use the np.array() function:

In [8]:
array1 = np.array(list1)
print(array1)


[1 2 3 4]


To get a two-dimensional ndarray from a list, we must start with a Python list of lists:

In [10]:
list2 = [[1,2,3],[4,5,6]]
array2 = np.array(list2)
print(array2)

[[1 2 3]
 [4 5 6]]


Pandas Series and Dataframes:

Just as the ndarray is the foundation of the NumPy library, the Series is the core object of the pandas library. A pandas Series is very similar to a one-dimensional NumPy array, but it has additional functionality that allows values in the Series to be indexed using labels. A NumPy array does not have the flexibility to do this. This labeling is useful when you are storing pieces of data that have other data associated with them. 
A Series holds items of any one data type and can be created by sending in a scalar value, Python list, dictionary, or ndarray as a parameter to the pandas Series constructor. If a dictionary is sent in, the keys may be used as the indices.


In [59]:
# Create a Series using a NumPy array of ages with the default numerical indices
ages = np.array([13,25,19])
series1 = pd.Series(ages)
print(series1)


0    13
1    25
2    19
dtype: int32


When printing a Series, the data type of its elements is also printed. To customize the indices of a Series object, use the index argument of the Series constructor.

In [61]:
# Create a Series using a NumPy array of ages but customize the indices to be the names that correspond to each age
ages = np.array([13,25,19])
series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh'])
print(series1)


Emma      13
Swetha    25
Serajh    19
dtype: int32


Another important type of object in the pandas library is the DataFrame. This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).

There are different ways to fill a DataFrame such as with a CSV file, a SQL query, a Python list, or a dictionary. Here we have created a DataFrame using a Python list of lists. Each nested list represents the data in one row of the DataFrame. We use the keyword columns to pass in the list of our custom column names.

In [63]:
dataf = pd.DataFrame([
    ['John Smith','123 Main St',34],
    ['Jane Doe', '456 Maple Ave',28],
    ['Joe Schmo', '789 Broadway',51]
    ],
    columns=['name','address','age'])

print(dataf)

         name        address  age
0  John Smith    123 Main St   34
1    Jane Doe  456 Maple Ave   28
2   Joe Schmo   789 Broadway   51


The default row indices are 0,1,2..., but these can be changed. For example, they can be set to be the elements in one of the columns of the DataFrame. To use the names column as indices instead of the default numerical values, we can run the following command on our DataFrame:

In [31]:
dataf.set_index('name')


Unnamed: 0_level_0,address,age
name,Unnamed: 1_level_1,Unnamed: 2_level_1
John Smith,123 Main St,34
Jane Doe,456 Maple Ave,28
Joe Schmo,789 Broadway,51


# 10 minutes to pandas

Basic data structures in pandas

Pandas provides two types of classes for handling data

    __Series__ : a one-dimensional labeled array holding data of any type

        such as integers, strings, Python objects etc.

    __DataFrame__: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

#### Creating a **Series** by passing a list of values, letting pandas create a default **RangeIndex**


In [42]:
serie= pd.Series([1,2,3,np.nan,5,6])
print(serie)

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64


#### Creating a **DataFrame** by passing a NumPy array with a datetime index using **date_range()** and labeled columns:


In [3]:
datarange= pd.date_range('20240607',periods=7)
print(datarange)

DatetimeIndex(['2024-06-07', '2024-06-08', '2024-06-09', '2024-06-10',
               '2024-06-11', '2024-06-12', '2024-06-13'],
              dtype='datetime64[ns]', freq='D')


In [5]:
data_frame=pd.DataFrame(np.random.randn(7,4),index=datarange,columns=list("ABCD"))

In [54]:
print(data_frame)

                   A         B         C         D
2024-06-07 -2.238099  1.246563  0.588261 -0.013110
2024-06-08  1.236797 -0.150064 -0.966829 -0.524173
2024-06-09  0.618945 -0.839805 -0.302175  0.222257
2024-06-10 -0.952107  0.341111  1.367637  0.178250
2024-06-11  0.503426 -1.074760  0.360423 -0.516776
2024-06-12  0.877690 -1.062519  0.893875  1.579003
2024-06-13 -0.329142 -0.387801  0.004431 -0.520897


#### Creating a **DataFrame** by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [71]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
print(df2)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


## Viewing data

### Head and tail
To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is , but you may pass a custom number.

The method df.info() gives some statistics for each column.

In [100]:
data_frame.info() 

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7 entries, 2024-06-07 to 2024-06-13
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       7 non-null      float64
 1   B       7 non-null      float64
 2   C       7 non-null      float64
 3   D       7 non-null      float64
dtypes: float64(4)
memory usage: 280.0 bytes


In [73]:
data_frame.head() # muestra 5 registros del DataFrame

Unnamed: 0,A,B,C,D
2024-06-07,-2.238099,1.246563,0.588261,-0.01311
2024-06-08,1.236797,-0.150064,-0.966829,-0.524173
2024-06-09,0.618945,-0.839805,-0.302175,0.222257
2024-06-10,-0.952107,0.341111,1.367637,0.17825
2024-06-11,0.503426,-1.07476,0.360423,-0.516776


In [75]:
data_frame.tail(3) #muestra el numero de registros indicados en el tail()

Unnamed: 0,A,B,C,D
2024-06-11,0.503426,-1.07476,0.360423,-0.516776
2024-06-12,0.87769,-1.062519,0.893875,1.579003
2024-06-13,-0.329142,-0.387801,0.004431,-0.520897


### Select Columns

There are two possible syntaxes for selecting all values from a column:
1)  Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type customers['age'] to select the ages.
2) If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: df.MySecondColumn. In our example, we would type customers.age.

In [105]:
df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north = df.clinic_north
print(clinic_north)

#to see what data type you’ve created. 
print(type(clinic_north))

0    100
1     45
2     96
3     80
4     54
5    109
Name: clinic_north, dtype: int64
<class 'pandas.core.series.Series'>


To select two or more columns from a DataFrame, we use a list of the column names. To create the DataFrame shown above, we would use:


In [10]:
df_2 = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north_south= df_2[['clinic_north','clinic_south']]

print(clinic_north_south)
print(type(clinic_north_south))

   clinic_north  clinic_south
0           100            23
1            45           145
2            96            65
3            80            54
4            54            54
5           109            79
<class 'pandas.core.frame.DataFrame'>


Display the **DataFrame.index** or **DataFrame.columns**:

In [86]:
data_frame.index #Retorna los indices del DataFrame

DatetimeIndex(['2024-06-07', '2024-06-08', '2024-06-09', '2024-06-10',
               '2024-06-11', '2024-06-12', '2024-06-13'],
              dtype='datetime64[ns]', freq='D')

In [88]:
data_frame.columns # Retorna las columnas del dataframe

Index(['A', 'B', 'C', 'D'], dtype='object')

Return a NumPy representation of the underlying data with DataFrame.to_numpy() **without the index or column labels**:

In [93]:
data_frame.to_numpy()

array([[-2.2380994 ,  1.24656313,  0.58826056, -0.01310956],
       [ 1.23679673, -0.15006444, -0.96682929, -0.52417323],
       [ 0.61894532, -0.83980464, -0.30217464,  0.22225733],
       [-0.95210651,  0.3411105 ,  1.36763704,  0.1782496 ],
       [ 0.50342646, -1.07476043,  0.3604232 , -0.51677614],
       [ 0.87769006, -1.06251931,  0.89387502,  1.57900349],
       [-0.32914201, -0.38780052,  0.0044313 , -0.52089744]])

**NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column.**

In [100]:
data_frame.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

PIVOT by Pandas Transposing your data:

In [102]:
data_frame.T

Unnamed: 0,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13
A,-2.238099,1.236797,0.618945,-0.952107,0.503426,0.87769,-0.329142
B,1.246563,-0.150064,-0.839805,0.341111,-1.07476,-1.062519,-0.387801
C,0.588261,-0.966829,-0.302175,1.367637,0.360423,0.893875,0.004431
D,-0.01311,-0.524173,0.222257,0.17825,-0.516776,1.579003,-0.520897


**DataFrame.sort_index()** sorts by an axis:

In [115]:
data_frame.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2024-06-07,-0.01311,0.588261,1.246563,-2.238099
2024-06-08,-0.524173,-0.966829,-0.150064,1.236797
2024-06-09,0.222257,-0.302175,-0.839805,0.618945
2024-06-10,0.17825,1.367637,0.341111,-0.952107
2024-06-11,-0.516776,0.360423,-1.07476,0.503426
2024-06-12,1.579003,0.893875,-1.062519,0.87769
2024-06-13,-0.520897,0.004431,-0.387801,-0.329142


DataFrame.sort_values() sorts by values:

In [12]:
data_frame.sort_values(by="B")

Unnamed: 0,A,B,C,D
2024-06-13,-0.038482,-1.8968,-0.478478,0.858792
2024-06-12,1.033605,-1.093287,-1.04022,-0.837864
2024-06-08,0.163661,-0.556781,-0.065616,-0.330659
2024-06-09,0.071845,-0.270095,1.161171,-0.04567
2024-06-10,0.513841,0.576847,-0.487018,1.579464
2024-06-07,0.826671,1.21151,-0.465816,-1.018766
2024-06-11,-0.060594,1.731706,-1.940312,1.177172


### Selection
Getitem ([])
For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A

In [14]:
data_frame["A"]

2024-06-07    0.826671
2024-06-08    0.163661
2024-06-09    0.071845
2024-06-10    0.513841
2024-06-11   -0.060594
2024-06-12    1.033605
2024-06-13   -0.038482
Freq: D, Name: A, dtype: float64

For a DataFrame, passing a slice : selects matching rows:

In [26]:
data_frame[0:4]

Unnamed: 0,A,B,C,D
2024-06-07,0.826671,1.21151,-0.465816,-1.018766
2024-06-08,0.163661,-0.556781,-0.065616,-0.330659
2024-06-09,0.071845,-0.270095,1.161171,-0.04567
2024-06-10,0.513841,0.576847,-0.487018,1.579464


### Selection by label

See more in Selection by Label using DataFrame.loc() or DataFrame.at().

Selecting all rows (:) with a select column labels:


In [36]:
data_frame.loc[:,["A", "D"]]

Unnamed: 0,A,D
2024-06-07,0.826671,-1.018766
2024-06-08,0.163661,-0.330659
2024-06-09,0.071845,-0.04567
2024-06-10,0.513841,1.579464
2024-06-11,-0.060594,1.177172
2024-06-12,1.033605,-0.837864
2024-06-13,-0.038482,0.858792


For label slicing, both endpoints are included:

In [40]:
data_frame.loc['20240607':'20240609',["A","B"]]

Unnamed: 0,A,B
2024-06-07,0.826671,1.21151
2024-06-08,0.163661,-0.556781
2024-06-09,0.071845,-0.270095


### Selection by position

See more in Selection by Position using DataFrame.iloc() or DataFrame.iat().

Select via the position of the passed integers:

In [75]:
data_frame.iloc[2]

A    0.547998
B   -1.641096
C   -0.657083
D    0.321402
Name: 2024-06-09 00:00:00, dtype: float64

Integer slices acts similar to NumPy/Python:

In [79]:
data_frame.iloc[2:4,1:2]

Unnamed: 0,B
2024-06-09,-1.641096
2024-06-10,2.872498


Lists of integer position locations:

In [83]:
data_frame.iloc[[1,4],[1,3]]

Unnamed: 0,B,D
2024-06-08,-1.959197,-0.915989
2024-06-11,0.489057,0.093398


For slicing rows explicitly:

In [85]:
data_frame.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2024-06-08,0.36471,-1.959197,-0.420259,-0.915989
2024-06-09,0.547998,-1.641096,-0.657083,0.321402


For slicing columns explicitly:

In [90]:
data_frame.iloc[:,2:3]

Unnamed: 0,C
2024-06-07,0.750932
2024-06-08,-0.420259
2024-06-09,-0.657083
2024-06-10,-0.304363
2024-06-11,0.218573
2024-06-12,1.158234
2024-06-13,-0.241423


For getting a value explicitly:

In [98]:
print(data_frame)
data_frame.iloc[3,1]

                   A         B         C         D
2024-06-07 -0.312672 -0.235186  0.750932 -0.987485
2024-06-08  0.364710 -1.959197 -0.420259 -0.915989
2024-06-09  0.547998 -1.641096 -0.657083  0.321402
2024-06-10  0.448062  2.872498 -0.304363  0.150347
2024-06-11 -1.454357  0.489057  0.218573  0.093398
2024-06-12  0.651486 -1.473502  1.158234 -1.191296
2024-06-13  1.856535  0.005446 -0.241423  0.643110


2.8724982450251173

orders.iloc[-3:] would select the rows starting at the 3rd to last row and up to and including the final row

In [11]:
data_frame.iloc[-3:]

Unnamed: 0,A,B,C,D
2024-06-11,0.50333,0.469158,0.243498,-0.530348
2024-06-12,1.184401,0.062812,-1.007449,-1.163581
2024-06-13,-1.069846,1.530293,-0.144316,0.65306


### Select Rows with Logic I

You can select a subset of a DataFrame by using logical statements:

df[df.MyColumnName == desired_column_value]

In [19]:
newvar=data_frame[data_frame.A> 0.6]
print(newvar)

                   A         B         C         D
2024-06-12  1.184401  0.062812 -1.007449 -1.163581


### Select Rows with Logic II
You can also combine multiple logical statements, as long as each statement is in parentheses.

In [26]:
newvar=data_frame[(data_frame.A > 0.6 ) | (data_frame.B < 0.1)]
print(newvar)



                   A         B         C         D
2024-06-07 -0.638566 -0.328276  0.839849 -1.399170
2024-06-08 -0.804508 -0.084281 -0.037944 -0.116029
2024-06-10 -0.858162 -0.763235  0.368868 -1.045368
2024-06-12  1.184401  0.062812 -1.007449 -1.163581


In [28]:
df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

march_april = df[(df.month == 'March') | (df.month == 'April')]

print(march_april)

   month  clinic_east  clinic_north  clinic_south  clinic_west
2  March           81            96            65           96
3  April           80            80            54          180


### Select Rows with Logic III

Suppose we want to select tree rows

In [35]:
df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

january_february_march= df[df.month.isin(['January','February','March'])]

print(january_february_march)

      month  clinic_east  clinic_north  clinic_south  clinic_west
0   January          100           100            23          100
1  February           51            45           145           45
2     March           81            96            65           96


### Setting indices

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use .iloc().

We can fix this using the method .reset_index(). If we use the command df.reset_index(), we get a new DataFrame with a new set of indices; the old indices have been moved into a new column called 'index'. Unless you need those values for something special, it’s probably better to use the keyword drop=True so that you don’t end up with that extra column. If we run the command df.reset_index(drop=True), we get a new DataFrame without old indexes.
Notice that Using .reset_index() will return a new DataFrame, but we usually just want to modify our existing DataFrame. If we use the keyword inplace=True we can just modify our existing DataFrame.


In [39]:
df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

df2 = df.loc[[1, 3, 5]]

print(df2)
print("Note that the indices on df2 are not consecutive. Create a new DataFrame called df3 by resetting the indices on df2 (don’t use inplace or drop). Did df2 change after you ran this command?")

df3=df2.reset_index()
print(df3)
print("Reset the indices of df2 by using the keyword inplace=True and drop=True. Did the indices of df2 change? How is df2 different from df3?")
df3=df2.reset_index(inplace = True, drop = True)

print(df3)

      month  clinic_east  clinic_north  clinic_south  clinic_west
1  February           51            45           145           45
3     April           80            80            54          180
5      June          112           109            79          129
Note that the indices on df2 are not consecutive. Create a new DataFrame called df3 by resetting the indices on df2 (don’t use inplace or drop). Did df2 change after you ran this command?
   index     month  clinic_east  clinic_north  clinic_south  clinic_west
0      1  February           51            45           145           45
1      3     April           80            80            54          180
2      5      June          112           109            79          129
Reset the indices of df2 by using the keyword inplace=True and drop=True. Did the indices of df2 change? How is df2 different from df3?
None


In [48]:
print(df.iloc[2])

month           March
clinic_east        81
clinic_north       96
clinic_south       65
clinic_west        96
Name: 2, dtype: object


# Modifying DataFrames
## Adding a Column I

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.
    
Let’s use the following code to add that information to our DataFrame.

**df['Quantity'] = [100, 150, 50, 35]**

In [7]:
df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add columns here
df['Sold in Bulk?']=['Yes','Yes','No','No']
print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?
0           1  3 inch screw                  0.5   0.75           Yes
1           2   2 inch nail                  0.1   0.25           Yes
2           3        hammer                  3.0   5.50            No
3           4   screwdriver                  2.5   3.00            No


## Adding a Column II
We can also add a new column that is the same for all rows in the DataFrame
Suppose we know that all of our products are currently in-stock. We can add a column that says this:

**df['In Stock?'] = True**

In [11]:
# Add columns here
df['Is taxed?']='Yes'
print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?  \
0           1  3 inch screw                  0.5   0.75           Yes   
1           2   2 inch nail                  0.1   0.25           Yes   
2           3        hammer                  3.0   5.50            No   
3           4   screwdriver                  2.5   3.00            No   

  Is taxed?  
0       Yes  
1       Yes  
2       Yes  
3       Yes  


## Adding a Column III
Finally, you can add a new column by performing a function on the existing columns.

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item. The following code multiplies each Price by 0.075, the sales tax for our state:

**df['Sales Tax'] = df.Price * 0.075**

In [13]:
# Add column here
df['Margin']=df.Price-df['Cost to Manufacture']
print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?  \
0           1  3 inch screw                  0.5   0.75           Yes   
1           2   2 inch nail                  0.1   0.25           Yes   
2           3        hammer                  3.0   5.50            No   
3           4   screwdriver                  2.5   3.00            No   

  Is taxed?  Margin  
0       Yes    0.25  
1       Yes    0.15  
2       Yes    2.50  
3       Yes    0.50  


## Performing Column Operations

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

We can use the apply function to apply a function to every value in a particular column. For example, this code overwrites the existing 'Name' columns by applying the function upper to every row in 'Name'.

**df['Name'] = df.Name.apply(str.upper)**

In [22]:
df['Uppercase Description']=df.Description.apply(str.upper)
print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?  \
0           1  3 inch screw                  0.5   0.75           Yes   
1           2   2 inch nail                  0.1   0.25           Yes   
2           3        hammer                  3.0   5.50            No   
3           4   screwdriver                  2.5   3.00            No   

  Is taxed?  Margin Lowercase Description Uppercase Description  
0       Yes    0.25          3 inch screw          3 INCH SCREW  
1       Yes    0.15           2 inch nail           2 INCH NAIL  
2       Yes    2.50                hammer                HAMMER  
3       Yes    0.50           screwdriver           SCREWDRIVER  


## Reviewing Lambda Function

A lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable.

For example, the following lambda function multiplies a number by 2 and then adds 3:

**mylambda = lambda x: (x * 2) + 3**

print(mylambda(5))


In [None]:
mylambda= lambda x : x[0]+x[-1]

print(mylambda('Alejandro'))

### Reviewing Lambda Function: If Statements
In general, the syntax for an if function in a lambda function is:

lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]

In [25]:

mylambda= lambda x: 'Welcome to BattleCity!' if x >= 13 else 'You must be 13 or older'

print(mylambda(4))

You must be 13 or older


### Applying a Lambda to a Column
In Pandas, we often use lambda functions to perform complex operations on columns.
We could use the following code with a lambda function and the string method .split():

df['Email Provider'] = df.Email.apply(
    lambda x: x.split('@')[-1]
    )


#### For instance

df = pd.read_csv('employees.csv')
get_last_name= lambda x: x.split(' ')[-1]

print(df)

**df['last_name']=df.name.apply(get_last_name)**

## Applying a Lambda to a Row

We can also operate on multiple columns at once. If we use apply without specifying a single column and add the argument axis=1, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax row.column_name or row[‘column_name’].

df['Price with Tax'] = df.apply(lambda row:
     row['Price'] * 1.075
     if row['Is taxed?'] == 'Yes'
     else row['Price'],
     axis=1
)


## Example 

df = pd.read_csv('employees.csv')

total_earned = lambda row: (row.hourly_wage * 40) + ((row.hourly_wage * 1.5) * (row.hours_worked - 40)) \
	if row.hours_worked > 40 \
  else row.hourly_wage * row.hours_worked
  
df['total_earned'] = df.apply(total_earned, axis = 1)

print(df)

## Renaming Columns

When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use df.column_name (which tab-completes) rather than df['column_name'] (which takes up extra space).

You can change all of the column names at once by setting the .columns property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong. Here’s an example:

df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']

## Renaming Columns II

You also can rename individual columns by using the .rename method. Pass a dictionary like the one below to the columns keyword argument:

{'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}

df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.rename(columns={
    'name': 'First Name',
    'age': 'Age'},
    inplace=True)

### Some exercises 

import codecademylib3
import pandas as pd

orders = pd.read_csv('shoefly.csv')
#Examine the first 5 rows of the data using print and .head().
print(orders.head())

####  Add a new column called shoe_source, which is vegan if the materials is not leather and animal otherwise.

orders['shoe_source']= orders.shoe_material.apply(lambda x: 'animal' if x == 'leather'else 'vegan')

#### Using the columns last_name and gender create a column called salutation which contains Dear Mr. <last_name> for men and Dear Ms. <last_name> for women.

orders['salutation'] = orders.apply(lambda row: \
                                    'Dear Mr. ' + row['last_name']
                                    if row['gender'] == 'male'
                                    else 'Dear Ms. ' + row['last_name'],
                                    axis=1)

.. to be continue in "Boolean indexing" (https://pandas.pydata.org/docs/user_guide/10min.html#viewing-data)

In [2]:
print(len("The Great Barrier Reef is visible from outer space."))

51
