File Encoding Checker

A Python script to check and diagnose character encoding issues in files, with a focus on UTF-8 validation.

Setup

Clone the repository:

git clone https://github.com/calebdaniel-dev/check-character-encoding.git
cd check-character-encoding

Install the required dependencies:

pip install requirements.txt

Usage

Run the script by providing the path to your file:

python main.py path/to/your/file.csv

Understanding the Output

The script provides detailed information about your file's encoding:

For UTF-8 Files

If your file is properly UTF-8 encoded, you'll see:

File Analysis for: path/to/your/file.csv
--------------------------------------------------
✓ The file is UTF-8 encoded
File size: 1234 bytes

For Non-UTF-8 Files

If your file is not UTF-8 encoded, you'll see detailed diagnostics:

File Analysis for: path/to/your/file.csv
--------------------------------------------------
✗ The file is NOT UTF-8 encoded
File size: 1234 bytes

Diagnostics:
Detected encoding: iso-8859-1 (confidence: 92.3%)
UTF-8 decode error: 'utf-8' codec can't decode byte 0xe9 in position 145...

Problematic section:
Position: 145
Hex values: 61 62 63 e9 64 65 66 67 68 69
Printable: abc.defghi

The output includes:

File size in bytes
Detected encoding and confidence level
Location and details of the first UTF-8 validation error
A sample of the problematic section showing:
- Position: Where the error occurred in the file
- Hex values: The raw byte values around the error
- Printable: Human-readable version of the bytes

If you're not sure what to do with this, try sharing the output with ChatGPT along with the file, and asking it to help you fix the file.

Acting on the Results

If your file is not UTF-8 encoded, here are the steps to fix it:

Using Python

# Read with the detected encoding (e.g., 'latin1', 'iso-8859-1', etc.)
with open('your_file.csv', 'r', encoding='detected_encoding') as f:
    content = f.read()

# Save as UTF-8
with open('your_file_utf8.csv', 'w', encoding='utf-8') as f:
    f.write(content)

Using Command Line (Unix/Mac)

# Replace 'ISO-8859-1' with the detected encoding
iconv -f ISO-8859-1 -t UTF-8 input.csv > output_utf8.csv

Common Encoding Issues

Latin-1/ISO-8859-1 Characters: If you see bytes like e9 representing characters like 'é', your file is likely in Latin-1 encoding.
Windows-1252: Similar to Latin-1 but with additional characters in the 128-159 range.
UTF-8 with BOM: If the script detects a BOM (Byte Order Mark), it will be shown in the output. This is generally fine but some systems might need the BOM removed.

Best Practices

Always make a backup of your files before converting encodings.
Verify the converted file works correctly in your application.
When creating new files, explicitly specify UTF-8 encoding to prevent issues.

Troubleshooting

If you're getting unexpected results:

Check if your file has a BOM (Byte Order Mark)
Verify the detected encoding confidence level
Try opening the file in a text editor that shows encoding (like Notepad++ or VS Code)
For very large files, check if the encoding is consistent throughout the file

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

File Encoding Checker

Setup

Usage

Understanding the Output

For UTF-8 Files

For Non-UTF-8 Files

Acting on the Results

Using Python

Using Command Line (Unix/Mac)

Common Encoding Issues

Best Practices

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Languages

calebdaniel-dev/check-character-encoding

Folders and files

Latest commit

History

Repository files navigation

File Encoding Checker

Setup

Usage

Understanding the Output

For UTF-8 Files

For Non-UTF-8 Files

Acting on the Results

Using Python

Using Command Line (Unix/Mac)

Common Encoding Issues

Best Practices

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages