# Basic Analysis 

In text analysis, we often create features from our text. A feature is simply a piece of measurabble information. Before we do any cleaning or comeplex analysis, we can get a great overview of our dataset by extracting some simple high-lebel features. This helps us understand the "shape" of our text data. 

#### In this notebook, you will learn how to:
- Create new columns 
- Calculate basic features like character count, word count and punctuation count
- Calculate a more advanced feature like average word length

## Setup: Loading the BBC dataset
This cell will import pandas and load the BBC News dataset from the stable URL. The DataFrame will be called bbc_df, and the main column we'll be working with is 'text'

In [None]:
import pandas as pd

# This URL points to a stable dataset hosted on Google Cloud Storage.
url = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'

# Load the data into a DataFrame
bbc_df = pd.read_csv(url)

print("BBC News dataset loaded successfully!")
print("Shape of the DataFrame:", bbc_df.shape)
bbc_df.head()

Data loaded successfully!
Shape of the DataFrame: (2225, 2)


Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


### Basic Count Features 

Let's start creating three columns in our DataFrame to store some basic counts for each review. 

#### Character Count
This is the total number of characters in the review, including letters, spaces and punctuation. It gives us a rough idea of the review's length. We can get this easily using .str.len()

In [None]:
# Create a new column 'char_count'
bbc_df['char_count'] = bbc_df['text'].str.len()

# Look at the first few rows with our new column
bbc_df[['text', 'char_count']].head()

#### Worod Count
How many words are in each review? To get this, we can use a two-step process:
1. Use .str.split() to split each review string into a list of words
2. Use .apply(len) to run the len() function on each list to count the number of words

In [None]:
# Create a new column 'word_count'
bbc_df['word_count'] = bbc_df['text'].str.split().apply(len)

# Look at the first few rows with our new column
bbc_df[['text', 'word_count']].head()

#### Punctuation Count
Let's see how "noisy" our text is by counting the number of punctuation marks. This will be very useful later to see if our cleaning process works. We'll use str.count() with a regular expression that looks for common punctuation. _You don't need to understand what a regular expression is, just think of it as template that we can use to find certain charaters_

In [None]:
# Create a new column 'punc_count'
# The [.,!?:;-] part is a simple regular expression that counts any of those characters.
bbc_df['punc_count'] = bbc_df['text'].str.count(r'[.,!?:;-]')

# Look at the first few rows with our new column
bbc_df[['text', 'punc_count']].head()

### A more revealing Feature: Average Word Length
Since we can have longer reviews, we can calculate more interesting features. The Averaged word length can sometimes give us clues abou the complexity of style of the text.

To calculate this, we can't just divide the character count by the word cocunt, because the character count includes spaces!

Here is the correct way to do it:
1. Remove all the spaces from the text 
2. Count the characters in the text without spaces
3. Divide that by the total word count 

Let's write a small functionn to do this

In [None]:
# This function calculates the average word length for a given text
def calculate_avg_word_length(text):
    words = text.split()
    total_chars = len("".join(words))
    total_words = len(words)
    # Avoid dividing by zero for any empty text
    if total_words == 0:
        return 0
    return total_chars / total_words

# Create a new column 'avg_word_length' by applying our function to the 'text' column
bbc_df['avg_word_length'] = bbc_df['text'].apply(calculate_avg_word_length)

# Look at the first few rows with all our new feature columns
bbc_df[['text', 'char_count', 'word_count', 'avg_word_length']].head()

## Exercises

Time to practice. Use the imdb_df dataframe with its new columns to answer three questions

#### Exercise 1: The Longest Article
Find the word count of the longest article in the entire dataset. (Hint: use the .max() method on the 'word_count' column).


In [None]:
# Your code for Exercise 1 here

#### Exercise 2: Overall Average Word Count
What is the average word_count for an article across the whole dataset? (Hint: use the .mean() method on the 'word_count' column).

In [None]:
# Your code for Exercise 2 here

#### Exercise 3: Averages by Category (Advanced)
This dataset has a 'category' column. Can you calculate the average word count for each category? This is a very common task in data analysis.

(Hint: The easiest way to do this is with Pandas' .groupby() method: bbc_df.groupby('category')['word_count'].mean())

In [None]:
# Your code for Exercise 3 here

## Conclusion
You have learned to take a raw text dataset and immediately start creating meaningful features. You can see how even simple counts can give a better understanding of your data. You have probabbly also noticed how capitalisation and punctuation will affect our count. Im this next notebook, we will address this directly by cleaning and pre-processing our text to make it standardised and ready for more advanced analysis. 