<h1>Introduction to Pandas</h1>

<b>Import pandas library</b>

In [5]:
import pandas as pd

<h2>Series</h2> <p>It is a one-dimensional array-like object that can hold data of any type (integer, string, float, etc.)</p>

In [6]:
obj = pd.Series([2, 4, 6, 8, 10])

In [7]:
obj

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [8]:
obj.values

array([ 2,  4,  6,  8, 10])

It will return the data as a one-dimensional NumPy array.

In [9]:
obj.index

RangeIndex(start=0, stop=5, step=1)

It returns an Index object that contains the labels of the Series.

In [10]:
data = {'Ohio': 3500, 'Texas': 7100, 'Oregon': 1600, 'Utah': 5000}

In [11]:
obj2 = pd.Series(data)

In [12]:
obj2

Ohio      3500
Texas     7100
Oregon    1600
Utah      5000
dtype: int64

<p>It creates a pandas Series to represent this data.</p><p>The states will become the <b>index</b> labels, and the values will represent the corresponding numbers.</p>

In [13]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [14]:
obj3 = pd.Series(data, index=states)

In [15]:
obj3

California       NaN
Ohio          3500.0
Oregon        1600.0
Texas         7100.0
dtype: float64

<p>To create a pandas Series with a specific index order, you can specify the index explicitly when creating the Series. </p> 
<p>The label <b>'California'</b> is not present in the data dictionary, so its value is NaN.</p>
<p>When you create a pandas Series with a specific index, and the index includes labels that are not in the data dictionary, pandas will fill those entries with NaN (Not a Number) to indicate missing data</p>

In [16]:
pd.isnull(obj3)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

<p>This function is used to detect missing values in a pandas object. When applied to a Series like obj3, it returns a Series of the same shape with boolean values indicating whether each corresponding value is NaN (True) or not (False).</p>

In [17]:
pd.notnull(obj3)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

<p>This funcion is used to detect non-missing values in a pandas object. When applied to a Series like obj3, it returns a Series of the same shape with boolean values indicating whether each corresponding value is not NaN (True) or NaN (False).</p>

In [18]:
data2 = {'Ohio': 2100, 'Texas': 1800, 'Oregon': 500, 'Utah': 1200}

In [19]:
states2 = ['California', 'Ohio', 'Oregon', 'Texas', 'Utah']

In [20]:
obj5 = pd.Series(data2,index=states2)

In [21]:
obj2

Ohio      3500
Texas     7100
Oregon    1600
Utah      5000
dtype: int64

In [22]:
obj5

California       NaN
Ohio          2100.0
Oregon         500.0
Texas         1800.0
Utah          1200.0
dtype: float64

In [23]:
obj2 + obj5

California       NaN
Ohio          5600.0
Oregon        2100.0
Texas         8900.0
Utah          6200.0
dtype: float64

<p>You can perform arithmetic operations, such as addition, between two Series objects.</p> 
<p>When you add two Series, pandas aligns them by their index labels and performs the operation element-wise. <br>If a label is not present in both Series, the resulting value will be NaN for that label.</p>

<hr>
<h2>DataFrame</h2>
<p>a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).</p>

In [24]:
data = {
        'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'population' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
         'year' : ['2000', '2001', '2002', '2001', '2002', '2003']
       }

In [25]:
frame = pd.DataFrame(data)

In [26]:
frame

Unnamed: 0,state,population,year
0,Ohio,1.5,2000
1,Ohio,1.7,2001
2,Ohio,3.6,2002
3,Nevada,2.4,2001
4,Nevada,2.9,2002
5,Nevada,3.2,2003


<p>It creates a DataFrame from the provided dictionary <b>data</b></p>
<p>The <b>data</b> can be any acceptable data structure like a dictionary, list, or another DataFrame.</p>

In [27]:
frame.head()

Unnamed: 0,state,population,year
0,Ohio,1.5,2000
1,Ohio,1.7,2001
2,Ohio,3.6,2002
3,Nevada,2.4,2001
4,Nevada,2.9,2002


A method in pandas that is used to view the first few rows of a DataFrame. <br> By default, it returns the first five rows, but you can specify the number of rows to return.<br>You can specify any number of rows to view by passing an integer as an argument to the head() method.

In [28]:
frame = pd.DataFrame(data, columns=['state', 'year', 'population'])

When creating a DataFrame in pandas, you can specify the columns explicitly to ensure they are in the desired order.

In [29]:
frame

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


<b>Reindexing</b>

In [30]:
frame2 = pd.DataFrame(data, columns=['state', 'year', 'population'], index=['A', 'B', 'C', 'D', 'E', 'F'])

In [31]:
frame2

Unnamed: 0,state,year,population
A,Ohio,2000,1.5
B,Ohio,2001,1.7
C,Ohio,2002,3.6
D,Nevada,2001,2.4
E,Nevada,2002,2.9
F,Nevada,2003,3.2


You can also specify custom row indices when creating a DataFrame in pandas.

<b>Accessing Columns</b>

In [32]:
frame2['state']

A      Ohio
B      Ohio
C      Ohio
D    Nevada
E    Nevada
F    Nevada
Name: state, dtype: object

To access the <b>'state'</b> column from the frame2 DataFrame, you can use the bracket notation. This will return a pandas Series containing the data from the <b>'state'</b> column.

In [33]:
frame2.year

A    2000
B    2001
C    2002
D    2001
E    2002
F    2003
Name: year, dtype: object

You can access a column using the dot notation if the column name doesn't contain any spaces or special characters. 

In [34]:
frame2[['state', 'year']]

Unnamed: 0,state,year
A,Ohio,2000
B,Ohio,2001
C,Ohio,2002
D,Nevada,2001
E,Nevada,2002
F,Nevada,2003


To access multiple columns from a DataFrame in pandas, you can use double brackets [[]]. <br>This will return a new DataFrame containing only the specified columns.

<b>Differences of the methods in accessing columns</b>

<p>Dot Notation (df.column_name): Limited to valid Python identifiers and accesses a single column, returning a Series.</p>
<p>Single Bracket Notation (df['column_name']): Works with any column name and accesses a single column, returning a Series.</p>
<p>Double Bracket Notation (df[['column1', 'column2']]): Used to access multiple columns and returns a new DataFrame.</p>

<b>Selection with loc and iloc</b>

In [35]:
frame2.loc['C']

state         Ohio
year          2002
population     3.6
Name: C, dtype: object


The <b>loc</b> accessor in pandas is used to access a group of rows and columns by labels or a boolean array. <br>When you use <b>df.loc['C']</b>, you are retrieving the row with the index label <b>'C'</b>.

In [36]:
frame2.iloc[2]

state         Ohio
year          2002
population     3.6
Name: C, dtype: object

The <b>iloc</b> accessor in pandas is used to access rows and columns by their integer positions (indices). <br>When you use <b>df.iloc[2]</b>, you are retrieving the row at the 3rd position (index 2, since indexing is zero-based).

In [37]:
frame2.loc['B','state']

'Ohio'

It retrieves the value in the <b>'state'</b> column for the row with the index label <b>'B'</b>

In [38]:
frame2.loc[['A', 'B'],['state', 'year']]

Unnamed: 0,state,year
A,Ohio,2000
B,Ohio,2001


It selects rows with the labels <b>'A'</b> and <b>'B'</b> and columns <b>'state'</b> and <b>'year'</b>.

<b> Differences </b>
<p>Single Row and Single Column : Access a single cell in the DataFrame and returns a scalar value .</p>
<p>Multiple Rows and Multiple Columns : Access multiple rows and columns and returns a new DataFrame with the specified rows and columns.</p>

<b>Dropping Entries from an Axis</b>

In [39]:
frame3 = pd.DataFrame(data, columns=['state', 'year', 'population', 'dept'], index=['A', 'B', 'C', 'D', 'E', 'F'])

In [40]:
frame3

Unnamed: 0,state,year,population,dept
A,Ohio,2000,1.5,
B,Ohio,2001,1.7,
C,Ohio,2002,3.6,
D,Nevada,2001,2.4,
E,Nevada,2002,2.9,
F,Nevada,2003,3.2,


In [41]:
frame3.drop('dept', axis=1)

Unnamed: 0,state,year,population
A,Ohio,2000,1.5
B,Ohio,2001,1.7
C,Ohio,2002,3.6
D,Nevada,2001,2.4
E,Nevada,2002,2.9
F,Nevada,2003,3.2


<p>This command drops the column named <b>'dept'</b> from the DataFrame frame3</p>
<p>To drop a column from a DataFrame in pandas, you can use the drop method with the <b>axis=1</b> argument, which indicates that you want to drop a column (as opposed to a row).</p>
<p><b>'dept'</b> : The name of the column to be dropped. <br>
<b>axis=1</b> : Specifies that a column is being dropped (instead of a row).</p>

In [42]:
frame3

Unnamed: 0,state,year,population,dept
A,Ohio,2000,1.5,
B,Ohio,2001,1.7,
C,Ohio,2002,3.6,
D,Nevada,2001,2.4,
E,Nevada,2002,2.9,
F,Nevada,2003,3.2,


In [43]:
frame3.drop('dept', axis=1, inplace=True)

Using the <b>inplace=True</b> parameter in the drop method allows you to modify the <b>original DataFrame directly</b> without needing to reassign it, meaning the original DataFrame <b>frame3</b> is updated directly.

In [44]:
frame3

Unnamed: 0,state,year,population
A,Ohio,2000,1.5
B,Ohio,2001,1.7
C,Ohio,2002,3.6
D,Nevada,2001,2.4
E,Nevada,2002,2.9
F,Nevada,2003,3.2


<b>Filtering</b>

In [45]:
frame3['population'] > 1.6

A    False
B     True
C     True
D     True
E     True
F     True
Name: population, dtype: bool

It creates a boolean Series where each element is <b>True</b> if the corresponding value in the <b>'population'</b> column is greater than <b>1.6</b>, and <b>False</b> otherwise.

In [46]:
import numpy as np

In [47]:
frame3['debt'] = np.arange(6.)

It creates an array of six values <b>(0.0 to 5.0)</b> and assign it to the <b>'debt'</b> column of <b>frame3</b>.

In [48]:
frame3

Unnamed: 0,state,year,population,debt
A,Ohio,2000,1.5,0.0
B,Ohio,2001,1.7,1.0
C,Ohio,2002,3.6,2.0
D,Nevada,2001,2.4,3.0
E,Nevada,2002,2.9,4.0
F,Nevada,2003,3.2,5.0


In [49]:
frame3[frame3['debt'] > 2]

Unnamed: 0,state,year,population,debt
D,Nevada,2001,2.4,3.0
E,Nevada,2002,2.9,4.0
F,Nevada,2003,3.2,5.0


<p>To filter a DataFrame based on a condition applied to one of its columns, you can use <b>boolean indexing</b>. </p>
<p><b>Boolean Indexing: frame3['debt'] > 2</b> creates a boolean Series where each element is <b>True</b> if the corresponding value in the <b>'debt'</b> column is greater than <b>2</b>, and False otherwise.</p>
<p><b>Filtering the DataFrame: frame3[frame3['debt'] > 2]</b> uses the boolean Series to filter the rows in <b>frame3</b> where the condition is <b>True</b>.</p>

In [50]:
frame3[(frame3['debt'] > 2) | (frame3['population'] < 3)]

Unnamed: 0,state,year,population,debt
A,Ohio,2000,1.5,0.0
B,Ohio,2001,1.7,1.0
D,Nevada,2001,2.4,3.0
E,Nevada,2002,2.9,4.0
F,Nevada,2003,3.2,5.0


<p>To filter a DataFrame based on multiple conditions, you can use logical operators like | (OR) and & (AND) along with parentheses to group conditions.</p>
<p><b>(frame3['debt'] > 2)</b> creates a boolean Series where each element is <b>True</b> if the corresponding <b>'debt'</b> value is greater than <b>2</b>.</p>
<p><b>(frame3['population'] < 3)</b> creates a boolean Series where each element is <b>True</b> if the corresponding <b>'population'</b> value is less than <b>3</b>.</p>
<p>The <b>| operator</b> combines these two conditions, creating a boolean Series where each element is <b>True</b> if <b>either condition is True</b>.</p>

In [51]:
frame3[(frame3['debt'] > 2) & (frame3['population'] < 3)]

Unnamed: 0,state,year,population,debt
D,Nevada,2001,2.4,3.0
E,Nevada,2002,2.9,4.0


<p>To filter a DataFrame based on multiple conditions using the logical AND operator, you can use the & operator along with parentheses to group the conditions.</p>
<p><b>(frame3['debt'] > 2)</b> creates a boolean Series where each element is <b>True</b> if the corresponding <b>'debt'</b> value is greater than <b>2</b>.</p>
<p><b>(frame3['population'] < 3</b> creates a boolean Series where each element is <b>True</b> if the corresponding <b>'population'</b> value is less than <b>3</b>.</p>
<p>The <b>& operator</b> combines these two conditions, creating a boolean Series where each element is <b>True</b> only <b>if both conditions are True</b>.</p>


In [52]:
frame3[['state','year']][frame3['population'] > 2]

Unnamed: 0,state,year
C,Ohio,2002
D,Nevada,2001
E,Nevada,2002
F,Nevada,2003


<p>To filter rows based on a condition and then select specific columns from the DataFrame, you can combine boolean indexing with column selection. </p>
<p><b>Boolean Indexing: frame3['population'] > 2</b> creates a boolean Series where each element is True if the corresponding 'population' value is greater than 2.</p>
<p><b>frame3[['state', 'year']]</b> selects the <b>'state'</b> and <b>'year'</b> columns from the DataFrame.</p>
<p><b>[frame3['population'] > 2]</b> filters the rows based on the condition <b>'population' > 2</b>.</p>

<b>Data Aggregation</b>

In [53]:
data = {
        'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
        'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
        'Sales':[200., 120., 340., 124., 243., 350.]
}

In [54]:
data

{'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'],
 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],
 'Sales': [200.0, 120.0, 340.0, 124.0, 243.0, 350.0]}

In [55]:
frame4 = pd.DataFrame(data);

In [56]:
frame4

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200.0
1,GOOG,Charlie,120.0
2,MSFT,Amy,340.0
3,MSFT,Vanessa,124.0
4,FB,Carl,243.0
5,FB,Sarah,350.0


In [57]:
by_company = frame4.groupby('Company')

Groups the data in <b>'frame4'</b> by the 'Company' column, creating a GroupBy object.

In [58]:
by_company.mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


<b>by_company.mean()</b> calculates the mean of each numerical column for each group and returns a new DataFrame with the averaged results

In [59]:
by_company.sum().loc['FB']

Sales    593.0
Name: FB, dtype: float64

To retrieve the sum of sales for a specific company (e.g., 'FB') after grouping by the <b>'Company'</b> column, you can use the <b>sum()</b> method on the GroupBy object and then use <b>.loc</b> to select the specific company's row.

In [60]:
by_company.count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


This method is used to count the number of <b>non-NA/null</b> entries for each column in each group. <br>When you use it on a GroupBy object, it will provide a <b>count of the non-null values</b> for each column in each group.

In [61]:
by_company.min()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Carl,243.0
GOOG,Charlie,120.0
MSFT,Amy,124.0


This method in pandas is used to compute the <b>minimum value</b> for each column in each group.

In [62]:
by_company.max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350.0
GOOG,Sam,200.0
MSFT,Vanessa,340.0


This method in pandas is used to compute the <b>maximum value</b> for each column in each group.

In [63]:
by_company.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


This method in pandas provides a summary of statistics for each column in each group, including <b>count</b>, <b>mean</b>, <b>standard deviation</b>, <b>min</b>, <b>25th percentile (Q1)</b>, <b>median (50th percentile)</b>, <b>75th percentile (Q3)</b>, and <b>max</b>. This method is useful for getting a quick overview of the distribution and spread of data within each group.

<b>Merging</b> 

In [64]:
student_data = {'id': ['1001', '1002', '1003', '1004', '1005'],
                'name': ['Louisa Decker', 'Sioned Allan', 'Cayden Collier', 'Jerome Rudd', 'Jibril Morrow'],
                'section': ['3A', '3A', '3B', '3D', '3E']
               }

In [65]:
grades_data = {
    'id': ['1001', '1002', '1003', '1004', '1005'],
    'math': [91, 85, 86, 83, 75],
    'filipino': [88, 77, 89, 93, 80],
    'science': [92, 87, 85, 82, 78],
    'english': [90, 86, 88, 89, 80]
}

In [66]:
student = pd.DataFrame(student_data)

In [67]:
grade = pd.DataFrame(grades_data)

Combines two DataFrames, <b>student</b> and <b>grade</b>, on the <b>id</b> column, is a common operation in pandas. 

In [68]:
student_grades = pd.merge(student,grade,how='inner',on='id')

In [69]:
student_grades

Unnamed: 0,id,name,section,math,filipino,science,english
0,1001,Louisa Decker,3A,91,88,92,90
1,1002,Sioned Allan,3A,85,77,87,86
2,1003,Cayden Collier,3B,86,89,85,88
3,1004,Jerome Rudd,3D,83,93,82,89
4,1005,Jibril Morrow,3E,75,80,78,80


In [70]:
student_df = pd.DataFrame(student_grades)

In [71]:
student_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        5 non-null      object
 1   name      5 non-null      object
 2   section   5 non-null      object
 3   math      5 non-null      int64 
 4   filipino  5 non-null      int64 
 5   science   5 non-null      int64 
 6   english   5 non-null      int64 
dtypes: int64(4), object(3)
memory usage: 320.0+ bytes


Calculate the <b>total scores</b> for each student by summing their scores in specific subjects (math, Filipino, science, and English).

In [106]:
total_scores = student_df[['math', 'filipino', 'science', 'english']].sum(axis=1)

Calculate the <b>GPA</b> based on the total scores you obtained earlier by normalizing the total scores against the maximum possible score and then rounding the result.

In [107]:
student_df['gpa'] = round((total_scores / 400) * 100)

In [108]:
student_df

Unnamed: 0,id,name,section,math,filipino,science,english,gpa,grade
0,1001,Louisa Decker,3A,91,88,92,90,90.0,A+
1,1002,Sioned Allan,3A,85,77,87,86,84.0,B
2,1003,Cayden Collier,3B,86,89,85,88,87.0,A
3,1004,Jerome Rudd,3D,83,93,82,89,87.0,A
4,1005,Jibril Morrow,3E,75,80,78,80,78.0,B+


<b>Function Application</b>

In [109]:
student_df['remarks'] = student_df['gpa'].apply(lambda grade: 'Passed' if grade > 74 else 'Failed')

Using <b>apply</b>: The apply method is called on the <b>'gpa'</b> column. This allows you to apply a function to each element in that column. <br>
<b>Lambda Function</b>: The lambda function checks if the grade (GPA) is greater than 75. If it is, it returns 'Passed'; otherwise, it returns 'Failed'. <br>
The <b>remarks</b> of the classification are stored in a new column named <b>'status'</b>.

In [111]:
student_df

Unnamed: 0,id,name,section,math,filipino,science,english,gpa,grade,remarks
0,1001,Louisa Decker,3A,91,88,92,90,90.0,A+,Passed
1,1002,Sioned Allan,3A,85,77,87,86,84.0,B,Passed
2,1003,Cayden Collier,3B,86,89,85,88,87.0,A,Passed
3,1004,Jerome Rudd,3D,83,93,82,89,87.0,A,Passed
4,1005,Jibril Morrow,3E,75,80,78,80,78.0,B+,Passed


In [77]:
def getLetterGrade(grade):
    if grade >= 90 and grade <= 100:
        return "A+"
    elif grade >= 85 and grade <= 89:
        return "A"
    elif grade >= 80 and grade <= 84:
        return "B"
    elif grade >= 77 and grade <= 79:
        return "B+"
    elif grade >= 73 and grade <= 76:
        return "B"
    elif grade >= 70 and grade <= 72:
        return "B-"
    else:
        return 'C+ below'

In [78]:
student_df['grade'] = student_df['gpa'].apply(getLetterGrade)

In [79]:
student_df

Unnamed: 0,id,name,section,math,filipino,science,english,gpa,grade
0,1001,Louisa Decker,3A,91,88,92,90,90.0,A+
1,1002,Sioned Allan,3A,85,77,87,86,84.0,B
2,1003,Cayden Collier,3B,86,89,85,88,87.0,A
3,1004,Jerome Rudd,3D,83,93,82,89,87.0,A
4,1005,Jibril Morrow,3E,75,80,78,80,78.0,B+


<b>Filtering missing out data</b>

In [80]:
data = pd.DataFrame([[1., 6.5, 3.5],[1., np.nan, np.nan]])

In [81]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.5
1,1.0,,


Removes any rows from a DataFrame that contain missing values (NaNs). 

In [82]:
cleaned = data.dropna()

In [83]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.5


A method used to remove rows from a DataFrame that contain only missing values (NaNs). 

In [84]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.5
1,1.0,,


If set to <b>'all'</b>, it will drop a row only if all the values in that row are NaN.
If set to <b>'any'</b> (the default), it will drop the row if any value is NaN.

In [85]:
data_population = {
        'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year' : ['2000', '2001', '2002', '2001', '2002', '2003'],
        'population' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
       }

In [86]:
data_state = pd.DataFrame(data_population, columns=['state', 'year', 'population', 'dept'], index=['A', 'B', 'C', 'D', 'E', 'F'])

In [87]:
data_state

Unnamed: 0,state,year,population,dept
A,Ohio,2000,1.5,
B,Ohio,2001,1.7,
C,Ohio,2002,3.6,
D,Nevada,2001,2.4,
E,Nevada,2002,2.9,
F,Nevada,2003,3.2,


Remove columns from a DataFrame that contain only missing values (NaNs).

In [88]:
data_state.dropna(axis=1, how='all')

Unnamed: 0,state,year,population
A,Ohio,2000,1.5
B,Ohio,2001,1.7
C,Ohio,2002,3.6
D,Nevada,2001,2.4
E,Nevada,2002,2.9
F,Nevada,2003,3.2


<b>Removing Duplicates</b>

In [89]:
data_dup = pd.DataFrame({
        'k1': ['one', 'two'] * 3 + ['two'], 
        'k2': [1, 1, 2, 3, 3, 4, 4]
})

In [90]:
data_dup

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [91]:
data_dup.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [92]:
data_dup.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
