# **Pandas**
Pandas is an open-source python package built on top of Numpy developed by Wes McKinney. It is used as one of the most important data cleaning and analysis tool. It provides fast, flexible, and expressive data structures.
Pandas is derived from the term “Panel-data-s” an econometrics term for data sets include observations over multiple time periods for the same individuals.

# Types of Pandas Data Structures
**Pandas deals with three types of data structures :**

1.**Series :** Series is a one-dimensional array-like structure with homogeneous data. The size of the series is immutable(cannot be changed) but its values are mutable.

2.**DataFrame :** DataFrame is a two-dimensional array-like structure with heterogeneous data. Data is aligned in a tabular manner(Rows &Columns form). The size and values of DataFrame are mutable.

3.**Panel Data :** The panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But it can be illustrated as a container of DataFrame. The size and values of a Panel are mutable.

# How to read and write files?

pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). Importing data from each of these data sources is provided by function with the prefix read_*. Similarly, the to_* methods are used to store data.

# Pandas Common Functions and Methods

1. Read Comma Separated Value (CSV) file : df.read_csv('file path')
2. Read Excel file : df.read_csv('file path')
3. View top rows : df.head()
4. View bottom rows : df.tail(
5. Information : df.info()
6. Summary statistics : df.describe()
7. Name of columns : df.columns
8. Shape : df.shape

# Important Terms :

**Python**

Python is a computer programming language often used to build websites and software, automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it can be used to create a variety of different programs and isn't specialized for any specific problems.

**Google Colab**

Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education.

**Pandas**

Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

**Data Frame**

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

**Data Analysis**

Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.

# Pandas Series

Pandas Series is a one-dimensional labeled array capable of holding any data type. Pandas Series is built on top of NumPy array objects.

# How Pandas Series is different from 1-D Numpy Array


1. Pandas Series can hold a variety of data types whereas Numpy supports only numerical data type
2. Pandas Series supports index labels.

## Pandas DataFrame

Pandas Dataframe is a two dimensional labeled data structure. It consists of rows and columns.
Each column in Pandas DataFrame is a Pandas Series.

# df.head()


Returns first 5 rows of dataframe (by default).

# df. tail()

Returns the last 5 rows of the dataframe(by default).

# df.info()

It prints the concise summary of the dataframe. This method prints information of the dataframe like column names, its datatypes, nonnull values, and memory usage

# df.dtypes()

Returns a series with the datatypes of each column in the dataframe.

# df.shape


Return the number of rows and columns of the dataframe.

# df. values


Return the NumPy representation of the DataFrame.
df.to_numpy() → This also returns the NumPy representation of the dataframe.

# df.columns


Return the column labels of the dataframe

# df. describe()

Generates descriptive statistics. It describes the summary of all numerical columns in the dataframe.

# df. describe(include=” all”)

It describes the summary of all columns in the dataframe.

# df.col_name.unique()


Returns the unique values in the column as a NumPy array.

# df.col_name.value_counts()

Return a Series containing counts of unique values.

# df.col_name.astype()

Converting datatype of a particular column.

# df.sort_values(by=”Col_1”)


It will sort the dataframe by column ”Col_1”

# train_test_split

The train_test_split function of the sklearn. model_selection package in Python splits arrays or matrices into random subsets for train and test data, respectively.

# Machine Learning
Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to “self-learn” from training data and improve over time, without being explicitly programmed. Machine learning algorithms are able to detect patterns in data and learn from them, in order to make their own predictions. In short, machine learning algorithms and models learn through experience.

In traditional programming, a computer engineer writes a series of directions that instruct a computer how to transform input data into a desired output. Instructions are mostly based on an IF-THEN structure: when certain conditions are met, the program executes a specific action.

Machine learning, on the other hand, is an automated process that enables machines to solve problems with little or no human input, and take actions based on past observations.

# Types of Machine Learning
**1. Supervised Learning** 

Supervised learning models make predictions based on labeled (y) training data. Each training sample includes an input (X) and a desired output(y). A supervised learning algorithm analyzes this sample data and makes an inference or estimate.

This is the most common and popular approach to machine learning. It’s “supervised” because these models need to be fed manually tagged sample data to learn from. Data is labeled to tell the machine what patterns (similar words and images, data categories, etc.) it should be looking for and recognize connections with.

For example, if you want to automatically detect spam, you would need to feed a machine learning algorithm examples of emails that you want classified as spam and others that are important, and should not be considered spam.
Which brings us to our next point – the two types of supervised learning tasks: classification and regression.


**a. Classification**

In classification problem, the output value is a category with two or more options. For example predicting gender, outcome of match as win or lose, fraud or no fraud, species of fish etc. 

**b. Regression**

In regression problem, the expected result or outcome is a continuous number. This model is used to predict quantities, such as the probability an event will happen, sales of a company, country GDP, prices, weight, height etc.  meaning the output may have any number value within a certain range.

**2. Unsupervised Learning**

Unsupervised learning algorithms uncover insights and relationships in unlabeled data. In this case, models are fed input data but the desired outcomes are unknown, so they have to make inferences based on circumstantial evidence, without any guidance or training. The models are not trained with the “right answer,” so they must find patterns on their own.

One of the most common types of unsupervised learning is clustering, which consists of grouping similar data. This method is mostly used for exploratory analysis and can help you detect hidden patterns or trends. For example, the marketing team of an e-commerce company could use clustering to improve customer segmentation. Given a set of income and spending data, a machine learning model can identify groups of customers with similar behaviors.

Segmentation allows marketers to tailor strategies for each key market. They might offer promotions and discounts for low-income customers that are high spenders on the site, as a way to reward loyalty and improve retention.


**3. Semi-Supervised Learning**

In semi-supervised learning, training data is split into two. A small amount of labeled data and a larger set of unlabeled data. In this case, the model uses labeled data as an input to make inferences about the unlabeled data, providing more accurate results than regular supervised-learning models.

This approach is gaining popularity, especially for tasks involving large datasets such as image classification. Semi-supervised learning doesn’t require a large number of labeled data, so it’s faster to set up, more cost-effective than supervised learning methods, and ideal for businesses that receive huge amounts of data.

**4. Reinforcement Learning**

Reinforcement learning (RL) is concerned with how a software agent (or computer program) ought to act in a situation to maximize the reward. In short, reinforced machine learning models attempt to determine the best possible path they should take in a given situation. They do this through trial and error. Since there is no training data, machines learn from their own mistakes and choose the actions that lead to the best solution or maximum reward.

This machine learning method is mostly used in robotics and gaming. Video games demonstrate a clear relationship between actions and results, and can measure success by keeping score. Therefore, they’re a great way to improve reinforcement learning algorithms.

**Deep Learning (DL)**

Deep learning models can be supervised, semi-supervised, or unsupervised (or a combination of any or all of the three). They’re advanced machine learning algorithms.

Deep learning is based on Artificial Neural Networks (ANN), a type of computer system that emulates the way the human brain works. Deep learning algorithms or neural networks are built with multiple layers of interconnected neurons, allowing multiple systems to work together simultaneously, and step-by-step.

When a model receives input data ‒ which could be image, text, video, or audio ‒ and is asked to perform a task (for example, text classification with machine learning), the data passes through every layer, enabling the model to learn progressively. It’s kind of like a human brain that evolves with age and experience!

Deep learning is common in image recognition, speech recognition, and Natural Language Processing (NLP). Deep learning models usually perform better than other machine learning algorithms for complex problems and massive sets of data. However, they generally require large dataset training data and also takes quite a lot of time to train them.




# How Machine Learning Works

At its most simplistic, the machine learning process involves three steps:
1. Feed a machine learning model training input data. In our case, this could be customer comments from social media or customer service data.

2. Tag training data with a desired output. In this case, tell your sentiment analysis model whether each comment or piece of data is Positive, Neutral, or Negative. The model transforms the training data into text vectors – numbers that represent data features.

3. Test your model by feeding it testing (or unseen) data. Algorithms are trained to associate feature vectors with tags based on manually tagged samples, then learn to make predictions when processing unseen data.
If your new model performs to your standards and criteria after testing it, it’s ready to be put to work on all kinds of new data. If it’s not performing accurately, you’ll need to keep training. Furthermore, as human language and industry-specific language morphs and changes, you may need to continually train your model with new information.

# Machine Learning Use Cases

Machine learning applications and use cases are nearly endless, especially as we begin to work from home more (or have hybrid offices), become more tied to our smartphones, and use machine learning-guided technology to get around.
Machine learning in finance, healthcare, hospitality, government, and beyond, is already in regular use. Businesses are beginning to see the benefits of using machine learning tools to improve their processes, gain valuable insights from unstructured data, and automate tasks that would otherwise require hours of tedious, manual work (which usually produces much less accurate results).For example, UberEats uses machine learning to estimate optimum times for drivers to pick up food orders, while Spotify leverages machine learning to offer personalized content and personalized marketing. And Dell uses machine learning text analysis to save hundreds of hours analyzing thousands of employee surveys to listen to the voice of employee (VoE) and improve employee satisfaction.

How do you think Google Maps predicts peaks in traffic and Netflix creates personalized movie recommendations, even informs the creation of new content ? By using machine learning, of course.

There are many different applications of machine learning, which can benefit your business in countless ways. You’ll just need to define a strategy to help you decide the best way to implement machine learning into your existing processes. In the meantime, here are some common machine learning use cases and applications that might spark some ideas:

Social Media Monitoring
Customer Service & Customer Satisfaction
Image Recognition
Virtual Assistants
Product Recommendations
Stock Market Trading
Medical Diagnosis

# Some Open Source Libraries for Machine Learning

Open source machine learning libraries offer collections of pre-made models and components that developers can use to build their own applications, instead of having to code from scratch. They are free, flexible, and can be customized to meet specific needs.

Some of the most popular open-source libraries for machine learning include:

Scikit-learn
PyTorch
Kaggle
NLTK
TensorFlow

**Scikit-learn**

Scikit-learn is a popular Python library and a great option for those who are just starting out with machine learning. Why? It’s easy to use, robust, and very well documented. You can use this library for tasks such as classification, clustering, and regression, among others.

**PyTorch**

Developed by Facebook, PyTorch is an open source machine learning library based on the Torch library with a focus on deep learning. It’s used for computer vision and natural language processing, and is much better at debugging than some of its competitors.

**NLTK**

The Natural Language Toolkit (NLTK) is possibly the best known Python library for working with natural language processing. It can be used for keyword search, tokenization and classification, voice recognition and more.

**TensorFlow**

An open-source Python library developed by Google for internal use and then released under an open license, with tons of resources, tutorials, and tools to help you hone your machine learning skills. Suitable for both beginners and experts, this user-friendly platform has all you need to build and train machine learning models (including a library of pre-trained models). Tensorflow is more powerful than other libraries and focuses on deep learning, making it perfect for complex projects with large-scale data. However, it may take time and skills to master.

**Keras**

Keras is a high-level, deep learning API developed by Google for implementing neural networks. It is written in Python and is used to make the implementation of neural networks easy. It also supports multiple backend neural network computation.                         

Keras is relatively easy to learn and work with because it provides a python frontend with a high level of abstraction while having the option of multiple back-ends for computation purposes. This makes Keras slower than other deep learning frameworks, but extremely beginner-friendly. 

# Important Terms :
**Artificial Intelligence**

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

**Binary Classification**


A type of classification task that outputs one of two mutually exclusive classes. For example, a machine learning model that evaluates email messages and outputs either "spam" or "not spam" is a binary classifier.

**Categorical Data**

Features having a discrete set of possible values. For example, consider a categorical feature named house type, which has a discrete set of three possible values:

1. row house
2. multi story
3. farm house

By representing house style as categorical data, the model can learn the separate impacts of on house price.
Categorical features are sometimes called discrete features.

**Class**


One of a set of enumerated target values for a label. For example, in a binary classification model that detects spam, the two classes are spam and not spam. In a multi-class classification model that identifies dog breeds, the classes would be poodle, beagle, pug, and so on.

**Classification Model**


A type of model that distinguishes among two or more discrete classes. For example, a natural language processing classification model could determine whether an input sentence was in French, Spanish, or Italian.

**Clustering**


Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity to a centroid.

**Continuous Feature**


A floating-point feature with an infinite range of possible values.

**Data Analysis**

Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.

**Deep Model**


A type of neural network containing multiple hidden layers.


**Dimension Reduction**

Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding.

**Discrete Feature**



A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature.

**Ensemble**

A collection of models trained independently whose predictions are averaged or aggregated. In many cases, an ensemble produces better predictions than a single model. For example, a random forest is an ensemble built from multiple decision trees. Note that not all decision forests are ensembles.

**Feature**


An input variable used in making predictions.

**Label**


In supervised learning, the "answer" or "result" portion of an example. Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house's price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either "spam" or "not spam."

**Labeled Example**


An example that contains features and a label. In supervised training, models learn from labeled examples.

**Machine Learning**


A program or system that builds (trains) a predictive model from input data. The system uses the learned model to make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model. Machine learning also refers to the field of study concerned with these programs or systems.

**Multi-class Classification**


Classification problems that distinguish among more than two classes. For example, there are approximately 128 species of maple trees, so a model that categorized maple tree species would be multi-class. Conversely, a model that divided emails into only two categories (spam and not spam) would be a binary classification model.

**Numerical Data**


Features represented as integers or real-valued numbers. For example, in a real estate model, you would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to each other and possibly to the label. For example, representing the size of a house as numerical data indicates that a 200 square-meter house is twice as large as a 100 square-meter house. Furthermore, the number of square meters in a house probably has some mathematical relationship to the price of the house.
Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes should not be represented as numerical data in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features.


**Parameter**


A variable of a model that the machine learning system trains on its own. For example, weights are parameters whose values the machine learning system gradually learns through successive training iterations. Contrast with hyperparameter.

**Recommendation System**

A system that selects for each user a relatively small set of desirable items from a large corpus. For example, a video recommendation system might recommend two videos from a corpus of 100,000 videos, selecting Casablanca and The Philadelphia Story for one user, and Wonder Woman and Black Panther for another. A video recommendation system might base its recommendations on factors such as:
1. Movies that similar users have rated or watched.
2. Genre, directors, actors, target demographic

**Regression Model**


A type of model that outputs continuous (typically, floating-point) values. Compare with classification models, which output discrete values, such as "day lily" or "tiger lily."

**Reinforcement Learning (RL)**


A family of algorithms that learn an optimal policy, whose goal is to maximize return when interacting with an environment. For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses.

**Self-Supervised Learning**

A family of techniques for converting an unsupervised machine learning problem into a supervised machine learning problem by creating surrogate labels from unlabeled examples.

Some Transformer-based models such as BERT use self-supervised learning.
Self-supervised training is a semi-supervised learning approach.

**Semi-Supervised Learning**

Training a model on data where some of the training examples have labels but others don't. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.

Self-training is one technique for semi-supervised learning.

**Sentiment Analysis**

Using statistical or machine learning algorithms to determine a group's overall attitude—positive or negative—toward a service, product, organization, or topic. For example, using natural language understanding, an algorithm could perform sentiment analysis on the textual feedback from a university course to determine the degree to which students generally liked or disliked the course.

**Supervised Machine Learning**

Training a model from input data and its corresponding labels. Supervised machine learning is analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. Compare with unsupervised machine learning.

**Time Series Analysis**

A subfield of machine learning and statistics that analyzes temporal data. Many types of machine learning problems require time series analysis, including classification, clustering, forecasting, and anomaly detection. For example, you could use time series analysis to forecast the future sales of winter coats by month based on historical sales data.

**Unlabeled Example**

An example that contains features but no label. Unlabeled examples are the input to inference. In semi-supervised and unsupervised learning, unlabeled examples are used during training.

**Unsupervised Machine Learning**


Training a model to find patterns in a dataset, typically an unlabeled dataset.
The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.
Another example of unsupervised machine learning is principal component analysis (PCA). For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

**Weight**

A coefficient for a feature in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model.






# Classification Regression 06-09-2022

**Regression Problem:**

When there is a requirement to predict numeric values in nature (real, continuous, discreet, etc.), there is a need to quantify the impact of numerous variables on a numerical entity (also known as the dependent or Y variable).

**Classification Problem:**

This business problem is somewhat similar to regression problems. Here, there is a requirement to quantify the impact of variables (known as independent or X variables) on a dependent variable. However, they are different as here, and the Y variable is categorical. Thus, a model is to be developed based on the X variables and predicts observations into predefined classes.

**Forecasting Problem:**

When values are required to be predicted over a period of time and the time acts as a predictor, then these problems are known as forecasting problems.


**Segmentation Problem:**

There are situations where a bulk of data needs to be categorized. However, there are no pre-existing classes that can be used to supervise the model. Here the underlying patterns are to be detected and divided the observation into different categories. These categories are then defined by understanding the characteristics of the observations found in each particular class.

While all the above-mentioned business problems can be found in the industry, the most commonly found business problem is classification. Often businesses require their output to be categorized into predefined classes, and this is where classification models come in handy. 



# Types of Classification

Depending upon the dependent variable’s nature, different machine learning classification techniques can be understood. Of the various classification techniques, the most common ones are the following

**1. Binary Classification:**

The most basic and commonly used form of classification is a binary classification. Here, the dependent variable comprises two exclusive categories that are denoted through 1 and 0, hence the term Binary Classification. Often 1 means True and 0 means False. For example, if the business problem is whether the bank member was able to repay the loan and we have a feature/variable that says “Loan Defaulter,” then the response will either be 1 (which would mean True, i.e., Loan defaulter) or 0 (which would mean False, i.e., Non-Loan Defaulter). This classification has often formed the basis of various classification algorithms and is the kind of classification technique that is foremost understood.

**2. Binomial Classification:**

This classification type is technically like Binary Classification only as the Y variable comprises two categories. However, these categories may not be in the form of True and False. For example, if we have a dataset for multiple features that denote pixel density, we have a Y variable with two categories – “Car” or “Bike,” This type of classification is known as Binomial Classification. From a practical point of view, especially as far as Machine Learning is concerned, there is no difference as these two categories can also be encoded and denoted as 0 and 1, making this type look like a Binary Classification only.

**3. Multi-class Classification:**

An advanced form of classification, multi-class classification, is when the Y variable is comprised of more than two classes/categories.  Here each observation belongs to a class, and a classification algorithm has to establish the relationship between the input variables and them. Therefore, during prediction, each observation is assigned to a single exclusive class. For example, a business problem where there is a need to categorize whether an observation is “Safe,” “At-Risk,” or “Unsafe” then would be classed as a multi-class classification problem. Note – Each observation can belong to only one class, and multiple classes can’t be assigned to observation. Thus here, observation will either be “Safe” or “At-Risk” or “Unsafe” and can’t be multiple things.

**4. Multi-label Classification:**

This form of classification is similar to Multi-class classification. Here, the dependent variable has more than 2 categories; however, it is different from multi-class classification because here, an observation can be mapped to more than one category. Therefore, the classification algorithm here has to understand which classes an observation can be related to and understand the patterns accordingly. A common use-case of these types of classification problem is found in text mining related classification where an observation (e.g., text from a newspaper article) can have multiple categories in its corresponding dependent variable (such as “Politics,” “Name of Politicians Involved,” “Important Geographical Location” etc..).Thus, a classification can be of multiple types, and depending upon the business problem. We have to identify the kind of classification technique and the algorithms involved in such techniques.

# Logistic Regression Classifier

Logistic Regression is among the oldest ways of performing Classification. This algorithm belongs to the family of Generalized Family of models where a logit function is used to transform the Y variable. A linear model can be made to fit to come up with probabilities. In terms of the discipline, Logistic Regression is a classical statistical model and uses the traditional concepts of statistics to come up with probabilities.

# Classification Use Cases

**Email Classification:**

Among the most traditional and early use of classification, emails were often required to be classified as spam or not spam (i.e., being worthy of being in the inbox). This is done by developing a classifier model that goes through the content of the data and identifies if the email is spam or not.

**Image Classification:**

1. Extensive use of image classification is in the automobile industry, especially those trying to create self-driving cars.
2. Security Agencies are increasingly using image classification to identify culprits, while home security systems are heavily relying on Image classification to raise alarms in case of an intrusion.
3. Image classification’s lifesaving application has been in healthcare, where image classification has enabled early detection of diseases and has helped develop robots that use computer vision to perform complicated surgeries.

**Anomaly / Fraud Detection:**