Why do I need to learn machine learning basics libraries like NumPy, Pandas for NLP?

Learning libraries like NumPy, Pandas is important for NLP because they act like handy helpers for understanding and working with words.

NumPy and Pandas help organize and tidy up the data, making it easier to see what's happening in the text. It's like arranging words neatly in a list or table.


## NumPy

NumPy is a fundamental library in Python for numerical and scientific computing that is widely used in machine learning and data science. It provides support for multi-dimensional arrays (often referred to as ndarrays) and a wide range of mathematical functions to operate on these arrays efficiently. NumPy is the foundation of many scientific and data science libraries in Python.

Install NumPy by going to your terminal or command prompt and typing:

In [None]:
# !pip install numpy

Once you've installed NumPy you can import it as a library:

In [None]:
import numpy as np

Arrays: The primary data structure in NumPy is the ndarray (n-dimensional array). These arrays can be one-dimensional (vectors), two-dimensional (matrices), or multi-dimensional (tensors).

In [None]:
# Creating a 1D array
arr1d = np.array([1, 2, 3, 4, 5])

# Creating a 2D array (matrix)
arr2d = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

print(arr1d)
print('\n')
print(arr2d)

[1 2 3 4 5]


[[1 2 3]
 [4 5 6]
 [7 8 9]]


### Built-in Methods
There are lots of built-in ways to generate Arrays

arange

Return evenly spaced values within a given interval.


In [None]:
np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
np.arange(0,11,2)

array([ 0,  2,  4,  6,  8, 10])

zeros and ones

Generate arrays of zeros or ones


In [None]:
np.zeros(3)

array([0., 0., 0.])

In [None]:
np.zeros((5,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

linspace

Return evenly spaced numbers over a specified interval.


In [None]:
np.linspace(0,10,3)

array([ 0.,  5., 10.])

In [None]:
np.linspace(0,10,50)

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

### NumPy Indexing and Selection

In [None]:
#Creating sample array
arr = np.arange(0,11)

In [None]:
#Show
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Bracket Indexing and Selection

The simplest way to pick one or some elements of an array looks very similar to python lists:


In [None]:
#Get a value at an index
arr[8]

8

In [None]:
#Get values in a range
arr[1:5]

array([1, 2, 3, 4])

In [None]:
#Get values in a range
arr[0:5]

array([0, 1, 2, 3, 4])

Broadcasting(broadcasting refers to the ability of NumPy to treat arrays with different dimensions during arithmetic operations.)

Numpy arrays differ from a normal Python list because of their ability to broadcast:


In [None]:
#Setting a value with index range (Broadcasting)
arr[0:5]=100

#Show
arr

array([100, 100, 100, 100, 100,   5,   6,   7,   8,   9,  10])

### NumPy Operations

Arithmetic

You can easily perform array with array arithmetic, or scalar with array arithmetic.

In [None]:
import numpy as np
arr = np.arange(0,10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
arr + arr

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [None]:
arr * arr

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [None]:
arr - arr

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Pandas

Pandas is a widely-used Python library for data manipulation and analysis. It provides easy-to-use data structures and functions for working with structured data, making it an essential tool for data scientists, analysts, and developers. The two primary data structures in Pandas are:

DataFrame: Think of it as a table or spreadsheet with rows and columns, similar to a database table or an Excel sheet. It allows you to store and manipulate data in a tabular format, where each column can have a different data type.

Series: A Series is like a single column or a one-dimensional array, containing data and an associated label, called an index. Series can be thought of as the building blocks of a DataFrame.

Install pandas by going to your terminal or command prompt and typing:

In [None]:
!pip install pandas



Once you've installed pandas you can import it as a library:

In [None]:
import pandas as pd

### Data Loading

Pandas can read data from various sources, including CSV files, Excel spreadsheets, databases, and web APIs. You can load data into DataFrames for analysis.


In [None]:
# Create a DataFrame from a CSV file
df = pd.read_csv('/content/drive/MyDrive/NLP WORK/The Art of Natural language processing/7. Machine Learning Basics /data.csv')
df

Unnamed: 0,Footnotes,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,1,The Business Operations Survey samples busines...,BOS Series Data Collection ̶ DataInfo+,,,,
1,2,For more information about the number of busin...,New Zealand business demography,,,,
2,3,"See DataInfo+ for questionnaires, and more exp...",DataInfo+,,,,
3,4,Business size is defined by rolling mean emplo...,BOS Concepts ̶ DataInfo+,,,,
4,5,"Results for the mining and electricity, gas, w...",,,,,
5,6,Counts of businesses are randomly rounded to b...,,,,,
6,7,"Dollar values are rounded to nearest $1,000. A...",,,,,
7,8,Percentages are of total number of businesses ...,,,,,
8,9,Percentages are of total number of businesses ...,,,,,
9,10,Counts of employees are graduated randomly rou...,,,,,


In [None]:
# Load data from an Excel file into a DataFrame
df = pd.read_excel('/content/drive/MyDrive/NLP WORK/The Art of Natural language processing/7. Machine Learning Basics /data.xlsx', sheet_name='Sheet1')
df

Unnamed: 0,Footnotes,Unnamed: 1,Unnamed: 2
0,1,The Business Operations Survey samples busines...,BOS Series Data Collection ̶ DataInfo+
1,2,For more information about the number of busin...,New Zealand business demography
2,3,"See DataInfo+ for questionnaires, and more exp...",DataInfo+
3,4,Business size is defined by rolling mean emplo...,BOS Concepts ̶ DataInfo+
4,5,"Results for the mining and electricity, gas, w...",
5,6,Counts of businesses are randomly rounded to b...,
6,7,"Dollar values are rounded to nearest $1,000. A...",
7,8,Percentages are of total number of businesses ...,
8,9,Percentages are of total number of businesses ...,
9,10,Counts of employees are graduated randomly rou...,


In [None]:
# Load data from a JSON file into a DataFrame
df = pd.read_json('/content/drive/MyDrive/NLP WORK/The Art of Natural language processing/7. Machine Learning Basics /data.json', typ='series')
df

fruit    Apple
size     Large
color      Red
dtype: object

Creating a DataFrame:

In [None]:
import pandas as pd

# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)


Basic DataFrame Operations:

In [None]:
# Display the first few rows of the DataFrame
print(df.head())

# Display information about the DataFrame
print(df.info())

# Get summary statistics of the numeric columns
print(df.describe())

# Select a single column
print(df['Name'])

# Select multiple columns
print(df[['Name', 'Age']])

# Filter rows based on a condition
print(df[df['Age'] > 30])


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
None
        Age
count   3.0
mean   30.0
std     5.0
min    25.0
25%    27.5
50%    30.0
75%    32.5
max    35.0
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
      Name  Age
2  Charlie   35


### Data Cleaning

Pandas provides functions to clean and preprocess data by handling missing values, duplicates, and outliers. This is crucial for preparing data for analysis.

Handling Missing Values:

Missing values are common in real-world datasets. Pandas provides functions to detect and handle them.

In [None]:
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, 4, None]}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull())

# Drop rows with any missing values
df_cleaned = df.dropna()

# Fill missing values with a specific value
df_filled = df.fillna(0)


       A      B
0  False   True
1  False  False
2   True  False
3  False  False
4  False   True


Removing Duplicates:


Duplicate rows can distort analysis results. You can identify and remove duplicates in a DataFrame.

In [None]:
import pandas as pd

# Create a DataFrame with duplicate rows
data = {'A': [1, 2, 2, 3, 4],
        'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)

# Check for duplicates
print(df.duplicated())

# Remove duplicates
df_cleaned = df.drop_duplicates()


0    False
1    False
2     True
3    False
4    False
dtype: bool


Data Type Conversion:

Converting data types is crucial for proper analysis. You can change the data type of a column in Pandas.

In [None]:
import pandas as pd

# Create a DataFrame with a column of strings
data = {'A': ['1', '2', '3']}
df = pd.DataFrame(data)

# Convert the 'A' column to integers
df['A'] = df['A'].astype(int)


Renaming Columns:

You can rename columns to make them more informative or consistent.

In [None]:
import pandas as pd

# Create a DataFrame with columns to be renamed
data = {'old_name1': [1, 2, 3],
        'old_name2': ['X', 'Y', 'Z']}
df = pd.DataFrame(data)

# Rename columns
df.rename(columns={'old_name1': 'new_name1', 'old_name2': 'new_name2'}, inplace=True)


Removing Irrelevant Columns:

Sometimes, you may need to drop columns that are not relevant to your analysis.

In [None]:
import pandas as pd

# Create a DataFrame with columns to be dropped
data = {'A': [1, 2, 3],
        'B': ['X', 'Y', 'Z'],
        'C': [10, 20, 30]}
df = pd.DataFrame(data)

# Drop columns 'B' and 'C'
df_cleaned = df.drop(['B', 'C'], axis=1)


These are some common data cleaning tasks in Pandas, but data cleaning can vary significantly depending on the dataset and specific analysis goals. Pandas offers a wide range of functions to handle various data cleaning and preparation tasks effectively.

### Data Indexing and Selection:

 You can use labels or integer-based indexing to select specific rows and columns in DataFrames or Series.

Selecting Columns:

You can select one or more columns from a DataFrame using square brackets or dot notation.

In [None]:
import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3],
        'B': ['X', 'Y', 'Z']}
df = pd.DataFrame(data)

# Using square brackets
column_A = df['A']

# Using dot notation (for single columns without spaces in names)
column_B = df.B

# Selecting multiple columns
subset = df[['A', 'B']]


Selecting Rows by Index:

You can use .loc[] or .iloc[] to select rows by their index label or integer position, respectively.


In [None]:
import pandas as pd

# Create a DataFrame with custom index labels
data = {'A': [1, 2, 3],
        'B': ['X', 'Y', 'Z']}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])

# Selecting a single row by label
row = df.loc['row2']

# Selecting a single row by integer position
row = df.iloc[1]

# Selecting multiple rows
subset = df.loc[['row1', 'row3']]


Selecting Rows by Condition:

You can select rows based on specific conditions using boolean indexing.

In [None]:
import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3],
        'B': ['X', 'Y', 'Z']}
df = pd.DataFrame(data)

# Select rows where column 'A' is greater than 1
subset = df[df['A'] > 1]


Combining Row and Column Selection:

You can combine row and column selection using .loc[] or .iloc[].

In [None]:
import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3],
        'B': ['X', 'Y', 'Z']}
df = pd.DataFrame(data)

# Select a specific cell value
cell_value = df.loc[1, 'A']

# Select a subset of rows and columns
subset = df.loc[0:1, ['A', 'B']]


Setting Values:

You can also set values for specific cells, rows, or columns in a DataFrame.

In [None]:
import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3],
        'B': ['X', 'Y', 'Z']}
df = pd.DataFrame(data)

# Set a value for a specific cell
df.loc[1, 'A'] = 10

# Set values for an entire column
df['B'] = ['M', 'N', 'O']


### Data Manipulation:

 Pandas offers powerful tools for data manipulation, such as filtering, sorting, grouping, and aggregating data. It allows you to transform data easily.

Filtering Data:

Filtering allows you to select a subset of rows based on specific conditions.

In [None]:
import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': ['X', 'Y', 'Z', 'X', 'Y']}
df = pd.DataFrame(data)

# Filter rows where column 'A' is greater than 3
filtered_df = df[df['A'] > 3]
filtered_df

Unnamed: 0,A,B
3,4,X
4,5,Y


Sorting Data:

Sorting rearranges the rows of your DataFrame based on column values.

In [None]:
# Sort DataFrame by column 'A' in ascending order
sorted_df = df.sort_values(by='A')

# Sort DataFrame by multiple columns
sorted_df = df.sort_values(by=['A', 'B'])
sorted_df

Unnamed: 0,A,B
0,1,X
1,2,Y
2,3,Z
3,4,X
4,5,Y


Grouping and Aggregation:

Grouping allows you to group rows by a specific column or columns and apply aggregation functions to calculate summary statistics.

In [None]:
# Group by column 'B' and calculate the mean of column 'A' for each group
grouped_df = df.groupby('B')['A'].mean()
grouped_df

B
X    2.5
Y    3.5
Z    3.0
Name: A, dtype: float64

Pivot Tables:

Pivot tables allow you to summarize and analyze data in a structured format.

Creating a Pivot Table

In [None]:
data = {
    'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York'],
    'Temperature': [32, 75, 30, 77, 28],
    'Humidity': [80, 60, 85, 50, 75]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Date,City,Temperature,Humidity
0,2024-01-01,New York,32,80
1,2024-01-01,Los Angeles,75,60
2,2024-01-02,New York,30,85
3,2024-01-02,Los Angeles,77,50
4,2024-01-03,New York,28,75


In [None]:
#You can create a pivot table using the pivot_table() function.Specify the index, columns, and values for the pivot table.

pivot_table = df.pivot_table(index='Date', columns='City', values=['Temperature', 'Humidity'])
pivot_table

Unnamed: 0_level_0,Humidity,Humidity,Temperature,Temperature
City,Los Angeles,New York,Los Angeles,New York
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2024-01-01,60.0,80.0,75.0,32.0
2024-01-02,50.0,85.0,77.0,30.0
2024-01-03,,75.0,,28.0


This will create a pivot table with the 'Date' as index, 'City' as columns, and 'Temperature' and 'Humidity' as values.

Customizing the Pivot Table

You can customize the pivot table by adding aggregating functions, specifying additional columns, and handling missing values.

In [None]:
pivot_table = df.pivot_table(index='Date', columns='City', values='Temperature', aggfunc='mean', fill_value=0)
pivot_table

City,Los Angeles,New York
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-01-01,75,32
2024-01-02,77,30
2024-01-03,0,28


In this example, the pivot table will show the mean temperature for each date and city, and any missing values will be filled with 0.



Resetting Index

If you want to convert the index of the pivot table to columns, you can use the reset_index() function.

In [None]:
pivot_table.reset_index(inplace=True)  # The inplace=True parameter in pandas functions like reset_index() modifies the DataFrame in place, meaning it alters the DataFrame object itself rather than returning a new DataFrame.
pivot_table

City,Date,Los Angeles,New York
0,2024-01-01,75,32
1,2024-01-02,77,30
2,2024-01-03,0,28


This will convert the 'Date' index back to a regular column.

Merging and Joining DataFrames:

Merging combines two or more DataFrames based on common columns.

In [None]:
# Create two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})

# Merge DataFrames on the 'key' column
merged_df = pd.merge(df1, df2, on='key', how='inner')
merged_df

Unnamed: 0,key,value1,value2
0,B,2,4
1,C,3,5
