# Python for Engineers (Part II: External Libraries and Data Analysis)

Major topics that we will cover in this notebook are:

- Modules and external libraries
- Import tabular data via pandas
- Data analysis using pandas
- Data Visualization using matplotlib
- Supplementary information about NumPy library



### 1) Modules
In Python, a module is simply a file that contains Python code, typically functions, classes, and variables, which you can reuse in other programs. 

**Modules** help organize code, avoid repetition, and keep it easy to maintain. You can create your own module by saving a .py file with your code or use one of Python’s many built-in modules, like *math*, *datetime*, or *random*, which provide various functionalities.

#### Importing Modules
To use a module in your code, you import it using the ***import*** statement. For example:

In [1]:
# Your code goes here

Here we imported math module, and then used ***math.sqrt()*** function to calculate the square root of 16.

In [2]:
# Your code goes here

#### Importing Specific Functions
If you only need specific functions from a module, you can import them directly:

In [3]:
# Your code goes here

This allows you to use pi and sqrt() directly without referencing math.

Create an alias for a library or module when importing it to shorten programs. 
Use **import ... as ...** to give a library a short alias while importing it and then refer to items in the library using that shortened name.


In [4]:
# Your code goes here

### 2) Introduction to External Libraries

Most of the power of Python is in its external libraries.  In Python, a library is a collection of modules that provide reusable code to help programmers perform various tasks without having to write everything from scratch. Libraries enhance productivity by allowing developers to leverage existing functionality for common tasks, promoting efficiency and reducing redundancy in code. 

A library is a collection of modules, but the terms are often used interchangeably, especially since many libraries only consist of a single module, so don’t worry if you mix them.

### Examples of Popular Libraries

- **Pandas:** This library is invaluable for data analysis and manipulation. It allows users to work with structured data in the form of DataFrames and Series, making it ideal for handling large datasets.

- **Matplotlib:** A powerful library for data visualization, Matplotlib enables users to create static, interactive, and animated visualizations to effectively present data insights.

- **NumPy:** A fundamental library for numerical computing, NumPy provides support for arrays and matrices, along with a wide range of mathematical functions.

- **SciPy:** SciPy is used for scientific and technical computing, providing a wide range of high-level commands for tasks such as optimization, integration, signal processing, and statistics.

- **Scikit-learn:** Scikit-learn is used for various traditional machine learning tasks, including classification, regression, clustering, and dimensionality reduction.

- **PyTorch:** PyTorch offers a GPU-accelerated tensor library for numerical computations and is used for developing and training deep learning models in applications such as computer vision, natural language processing, and robotics,

Libraries are important tools in Python that help programmers work more efficiently. They provide ready-made code that can be reused, making it easier to complete common tasks without starting from scratch. By using libraries, developers can solve complicated problems more easily and concentrate on the bigger picture of their projects. 

### 3) Importing tabular data via Pandas library

Pandas is an open-source Python library offering tools and functions for both data manipulation and analysis.Pandas is widely used for handling structured data, such as tabular data, information found in spreadsheets and databases.  It supports various formats, including csv, xlsx, JSON, and more. Pandas use dataframe as an object type for tabular data, a 2-dimensional table whose columns have names and potentially have different data types. We first need to load Pandas with **import pandas as pd**. The alias **pd** is commonly used to refer to the Pandas library in code.

In [5]:
# Your code goes here

The first thing we want to do, is to load our tabular data in python. In this example, we are using a comma separated values or CSV data file. To do this, we can Read a CSV data file with ***pd.read_csv()***. Here argument is the name and location of the file to be read and it will return a dataframe that you can assign to a variable.


In [6]:
# Your code goes here

Here the csv file is in a folder called "data". So you need to specify the exact location of datafile relative to your jupyter notebook location on your computer. The columns in a dataframe are the observed variables, and the rows are the observations. Pandas uses backslash \ to show wrapped lines when output is too wide to fit the screen. Using descriptive dataframe names helps us distinguish between multiple dataframes so we won’t accidentally overwrite a dataframe or read from the wrong one.


We can use **index_col** to specify that a column’s values should be used as row headings. In this example, row headings are numbers (0 and 1), and we really want to index rows by country names instead of numbers. Therefore, we pass the name of the column to read_csv as its index_col parameter to do this.


In [7]:
# Your code goes here

Use the ***DataFrame.info()*** method to find out more about a dataframe:

In [8]:
# Your code goes here

Based on the information, this is a DataFrame, it has two rows named 'Australia' and 'New Zealand', it has twelve columns, each of which has two actual 64-bit floating point values, and it uses 208 bytes of memory.


The DataFrame.columns variable stores information about the dataframe’s columns. Note that this is data, not a method (It doesn’t have parentheses). We call this an attribute or a member variable.

In [9]:
# Your code goes here

Use can use **DataFrame.T** to transpose a dataframe (switch columns with rows). Transpose doesn’t copy the data, just changes the program’s view of it.

In [10]:
# Your code goes here

Finally, we can use **DataFrame.describe()** to get summary statistics about data:

In [11]:
# Your code goes here

### Exercise 9:

Read the data in gapminder_gdp_americas.csv (which should be in the same directory as gapminder_gdp_oceania.csv) into a variable called data_americas and display its summary statistics.

In [12]:
# Your code goes here

### Exercise 10: 



After reading the data for the Americas, use help(data_americas.head) and help(data_americas.tail) to find out what DataFrame.head and DataFrame.tail do.

- What method call will display the first three rows of this data?
- What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)


In [13]:
# Your code goes here

In [14]:
# Your code goes here

### 4) Data analysis using Pandas

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames. To access a value at the position *[ i, j ]* of a DataFrame, we have two options, depending on what is the meaning of *i* in use. Remember that a DataFrame provides an index as a way to identify the rows of the table; **a row**, then, has a position inside the table as well as a **label**, which uniquely identifies its entry in the DataFrame. We can use **DataFrame.iloc[..., ...]** to select values by their (entry) position or numerical index.

In [15]:
# Your code goes here

In [16]:
# Your code goes here

Alternatively, We can use **DataFrame.loc[..., ...]** to select values by their (entry) label (by row and column name).

In [17]:
# Your code goes here

We can use **:** on its own to mean all columns or all rows:

In [18]:
# Your code goes here

In [19]:
# Your code goes here

We can select multiple columns or rows using **DataFrame.loc** and a named slice:

In [20]:
# Your code goes here

Note that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

Result of slicing can be used in further operations, and all the statistical operators that work on entire dataframes work the same way on slices:

In [21]:
# Your code goes here

In [22]:
# Your code goes here

We can use comparisons to select data based on value. The comparison is applied element by element, and returns a similarly-shaped dataframe of **True** and **False**.


In [23]:
# Your code goes here

A frame full of Booleans is sometimes called a **mask** because of how it can be used. 

In [24]:
# Your code goes here

We will get the value where the mask is **true**, and **NaN (Not a Number)** where it is false. This is useful because NaNs are ignored by operations like max, min, average, etc.


In [25]:
# Your code goes here

In pandas, the primary method used to eliminate NaN (Not a Number) values from a DataFrame is ***dropna()***.

Let’s say we want to eliminate rows including NaN values in the previous example.


In [26]:
# Your code goes here

In pandas, the ***fillna()*** method is used to eliminate NaN values in a DataFrame and replace them with a specified value, such as the average of a column:

In [27]:
# Your code goes here

In [28]:
# Your code goes here

### Exercise 11: 

Write an expression to find the Per Capita GDP of Serbia in 2007.

In [29]:
# Your code goes here

### 7) Data visualization using Matplotlib library

Visualization is a crucial skill for anyone working with data, as it enables us to transform complex datasets into clear and informative visual representations, such as charts, graphs, and plots. By learning visualization, you gain the ability to uncover patterns, trends, and insights within your data, making it easier to communicate findings and make data-driven decisions. 

**Matplotlib and Seaborn** are two python libraries that offer a wide range of customization options, making them invaluable tools for creating visually appealing and meaningful visualizations that can enhance data understanding, aid in storytelling, and facilitate effective data-driven communication. Whether you are a data scientist, analyst, or anyone working with data, mastering these visualization libraries is a fundamental step in your journey toward becoming a proficient data professional. Here, we will only focus on matplotlib. Let’s go through each visualization example step by step, starting with an explanation of what we are going to do, followed by the example code, and then a detailed explanation of the code.

**Line plot**

Line plots are used to visualize trends and changes in data over a continuous range, or time. We use line plots when we have data that can be represented as a series of points connected by lines, such as time series data or data with a natural ordering.

We will create a simple line plot using Matplotlib to visualize a set of data points.


In [30]:
# Your code goes here

#### Code explanation:

***import matplotlib.pyplot as plt***: This line imports the Matplotlib library, specifically the pyplot module, and aliases it as plt, which is a common convention.

***x and y*** represent the sample data points that we want to plot.

***plt.plot(x, y)***: This line creates a basic line plot using the plot function. It takes x and y as arguments to plot the data points.

***plt.xlabel('X-axis') and plt.ylabel('Y-axis')***: These lines label the X and Y axes, respectively, providing context for the plot.

***plt.title('Line Plot Example')***: This line adds a title to the plot.

***plt.show()***: This function displays the plot on the screen.

**Scatter plots**

Scatter plots are used to visualize individual data points as dots on a two-dimensional plane. They are valuable for identifying patterns, trends, and relationships between two variables. 

We will create a scatter plot using Matplotlib to visualize individual data points.


In [31]:
# Your code goes here

***plt.scatter(x, y, color='red', marker='o', label='Data Points')***: This line creates a scatter plot. We specify the color, marker style (in this case, a red circle), and label for the data points.

Scatter plots excel at providing precise information about individual data points. They allow us to see the exact coordinates of each point, which can be critical in certain analyses, for example exploring the relationship between x and y values.

Next, we will look at bar charts. 

**Bar charts**

Bar charts represent categorical data with discrete bars, making it easy to compare values across different categories. We use bar charts when we want to compare data across categories, show rankings, or display frequencies or counts for discrete items.

We will create a bar chart using Matplotlib to visualize categorical data and compare values.


In [32]:
# Your code goes here

#### Code Explanation

We ***import Matplotlib as plt*** as before.

categories and values represent the categorical data and their corresponding values.

***plt.bar(categories, values, color='skyblue')***: This line creates a bar chart. We specify the categories, values, and color for the bars.

***plt.xlabel('Categories') and plt.ylabel('Values')***: These lines label the X and Y axes, providing context for the plot.

***plt.title('Bar Chart Example')***: This line adds a title to the plot.

***plt.show()***: This function displays the plot.

The bar chart allows for a straightforward comparison of values between different categories. This provides a clear and intuitive representation of the data.

Next, we will look at histograms. 

**Histograms**

Historgrams are used to visualize the distribution of a variable. They divide the data into bins or intervals and show the frequency or count of data points in each bin. We use histograms when we want to understand the shape of a dataset’s distribution, identify central tendencies (mean, median, mode), and observe data skewness or the presence of multiple peaks.

We will create a histogram using Matplotlib to visualize the distribution of numerical data.



In [33]:
# Your code goes here

#### Code Explanation

We ***import Matplotlib as plt*** as before and also ***import NumPy as np*** for random data generation.

data is generated as random data using NumPy’s ***randn()*** function.

***plt.hist(data, bins=20, color='green', alpha=0.6)***: This line creates a histogram. We specify the data, the number of bins, the color of the bars, and their transparency (alpha).

***plt.xlabel('Values') and plt.ylabel('Frequency')***: These lines label the X and Y axes, providing context for the plot.

***plt.title('Histogram Example')***: This line adds a title to the plot.

***plt.show()***: This function displays the plot.


The histogram provides insights into the shape of the data distribution. In this case, it appears to be approximately normally distributed, centered around zero. Each bin on the x-axis represents a range of values, and the height of the bars indicates how many data points fall within that range. For example, the tallest bar around zero indicates a higher frequency of values in that range. The spread of the bars shows the variability of the data. Wider distributions indicate more variability, while narrower distributions suggest less variation.

Also remember, usually high bars or isolated bars far from the main cluster may indicate outliers or anomalies in the data. The center of the distribution (often represented by the highest point) gives an indication of the central tendency of the data.


**subplots**

Subplots in data visualization refer to the division of a single figure into numerous smaller plots, allowing the display of various visualizations or related data representations inside a shared space at the same time. They allow for side-by-side or grid-based chart configurations, aiding comparisons, and correlations. Subplots are extremely useful for highlighting distinct parts of data or different data dimensions, boosting the viewer's understanding by offering several views within a consistent visual framework. It allows analysts and researchers to easily express complicated relationships, trends, or comparisons, improving the clarity and depth of data findings.

Let's walk through creating a figure with multiple subplots, each showcasing a different type of plot explained above: 


In [34]:
# Your code goes here


**3-D plots (time-domain visualizations)**

Three-dimensional (3D) plots are important across different engineering disciplines because they allow engineers to visualize complex, multi-variable relationships that are impossible to grasp in two dimensions. In this example, we use a small dataset of simulated water measurements over time and space to visualize a wave surface in 3D. The data (wave.csv) is the result of simulation of the wave height (Z) at different positions (X) and times (T).

In [35]:
# Your code goes here


The code is set up to show the time history of the wave height at specific, fixed locations. This is exactly what you would get if you placed three stationary wave gauges (like buoys) at X=0, X=1, and X=2 and recorded the water surface elevation over time. This shows the temporal profile of the wave passing each sensor. This shows a travelling wave. If you look at the lines, the wave peak is reached slightly later as you move to higher X values. The wave starts at X=0, then moves to X=1, and finally to X=2. The peak height occurs at T=1 initially, and then it moves to T=2 and T=3. 

### Supplementary section: Numpy

NumPy, which stands for Numerical Python, is a powerful library in Python that helps with numerical and scientific computing. It allows you to work with large, multi-dimensional arrays (think of them as grids of numbers) and provides many mathematical functions to perform calculations on these arrays quickly and easily.

### Why Use NumPy?
- **Speed:** NumPy is much faster than using regular Python lists for numerical calculations. It’s written in C, which helps it run faster. When you use NumPy, you can perform operations on big sets of data quickly.

- **Convenience:** With NumPy, you can do calculations on entire arrays without writing long loops. For example, adding two arrays together can be done in just one line of code.

- **Useful Functions:** NumPy has many built-in functions for doing complex math. This is really helpful in fields like oceanography, where scientists need to work with large amounts of data and perform a lot of calculations.

## Arrays
An **array** is a list of items stored one after another in memory. In NumPy, an array is a collection of values arranged in a grid, where all values are the same type (like all numbers or all text).

To use NumPy, we first need to import it:

In [36]:
# Your code goes here

### Creating a 1D Array
You can create a 1D array using the **np.array()** function. Here’s how:

In [37]:
# Your code goes here

### Advantages of NumPy Arrays Over Python Lists
- **Homogeneous:** All elements in a NumPy array must be of the same type, whereas Python lists can contain mixed types.
- **Performance:** NumPy arrays are more memory efficient and faster for numerical computations compared to lists.
- **Functionality:** NumPy provides a range of built-in functions that are specifically optimized for working with arrays.

### Creating a 2D Array
To create a 2D array, pass a list of lists to **np.array()**:

In [38]:
# Your code goes here

### Creating Arrays with Built-in Functions

NumPy provides several built-in functions to create arrays quickly:

- **np.zeros(shape)**: Creates an array filled with zeros.
- **np.ones(shape)**: Creates an array filled with ones.
- **np.arange(start, stop, step)**: Creates an array with a range of values.
- **np.linspace(start, stop, num)**: Creates an array of evenly spaced numbers.
- **np.random.randn():** Generates random numbers from a uniform distribution between 0 and 1.

In [39]:
# Your code goes here

In this example:

- **np.zeros((2, 3))** creates a 2x3 array filled with zeros.
- **np.ones((2, 3))** creates a 2x3 array filled with ones.
- **np.arange(0, 10, 2)** generates a 1D array with values from 0 to 10, stepping by 2.
- **np.linspace(0, 1, 5)** creates an array of 5 evenly spaced numbers between 0 and 1.


### Array Properties
You can check important properties of a NumPy array, such as its shape, size, and data type:

In [40]:
# Your code goes here

**This code retrieves:**

- The shape of the 2D array, which tells you the dimensions (rows and columns).
- The size of the array, indicating the total number of elements.
- The data type of the elements in the array, which shows the kind of data stored (e.g., integers, floats).

### Basic Arithmetic Operations

NumPy allows you to perform element-wise arithmetic operations on arrays:

In [41]:
# Your code goes here

In this example, we create *two 1D arrays*, **arr_a** and **arr_b**, and add them together. 

Also, we perform element-wise multiplication. Each element of **arr_a** is multiplied by the corresponding element of **arr_b**, resulting in a new array of products.

### NumPy Universal Functions

NumPy provides a variety of universal functions that operate element-wise on arrays. These functions include mathematical operations, trigonometric functions, and statistical calculations. Statistical functions are particularly useful for analyzing data and obtaining insights from numerical arrays.

- **Mean:** Calculates the average value of the array elements.
- **Median:** Finds the middle value when the elements are sorted.
- **Standard Deviation:** Measures the amount of variation or dispersion in the dataset.

Here’s how to use these functions:

In [42]:
# Your code goes here

We can also find the minimum and maximum values in an array:

In [43]:
# Your code goes here

Also, we can compute the correlation coefficient between two datasets:

In [44]:
# Your code goes here

The correlation coefficient matrix shows the relationship between **arr_x** and **arr_y**. A value close to -1 indicates a strong negative correlation, while a value close to 1 indicates a strong positive correlation.

### Other Operations
#### Indexing

You can access individual elements of an array using indexing:

In [45]:
# Your code goes here

This code retrieves the first element of the 1D array **arr_1d** and the element located at row 1, column 2 in the 2D array **arr_2d**.

#### Slicing
Slicing allows you to access a subset of an array:

In [46]:
# Your code goes here

This retrieves a slice of the 1D array, including elements from index 1 up to, but not including, index 4.

In [47]:
# Your code goes here

Here, we extract the first row of the 2D array and the second column using slicing. The : operator means "select all elements in this dimension."

#### Reshaping Arrays
You can change the shape of an array without changing its data using the **reshape()** method:

In [48]:
# Your code goes here

In this code, we reshape the 1D array arr_1d into a 5x1 array. The total number of elements remains the same, but the structure changes.

#### Joining Arrays
NumPy allows you to **concatenate (join)** arrays using functions like **np.concatenate()**, **np.vstack()** (for vertical stacking), and **np.hstack()** (for horizontal stacking):

In [49]:
# Your code goes here

Here, we concatenate **arr_a** and **arr_b** into a single 1D array containing all their elements.

In [50]:
# Your code goes here

In this example, we create two 2D arrays and stack them vertically, resulting in a new array with four rows.

In [51]:
# Your code goes here

Similarly, we stack the two 2D arrays horizontally, which combines the columns.

#### Splitting Arrays
The **np.split()** function splits an array into multiple sub-arrays along a specified axis. The function takes two main arguments: the array to split and the number of splits you want to make.

In [52]:
# Your code goes here

We create a **1D array**, joined_array, containing ten elements, and the **np.split(joined_array, 2)** function is used to split the array into two equal parts, resulting in a list of sub-arrays. 

It's important to note that if the number of elements in the original array is not evenly divisible by the number of splits specified, NumPy will raise a *ValueError*.