Skip to content

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also .

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also .
...
  • 14 commits
  • 3 files changed
  • 0 commit comments
  • 1 contributor
Showing with 129 additions and 56 deletions.
  1. +11 −8 README.md
  2. +92 −21 detect-asciiart.php
  3. +26 −27 download.php
View
19 README.md
@@ -1,38 +1,41 @@
Using cURL and PHP to download lots and lots of files
=====================================================
-And then searching them for asciiart
-------------------------------------
+And then searching them for ASCII art
+-------------------------------------
Why?
----
-*Because*. OK?
+*Because*.
How?
----
+All the scripts are intended to be run from the command line. Aside from enabling piping and other nifty Unixy functionality, it lets you follow the progress in realtime, while Apache and your browser normally would send the script output in chunks a few kilobytes each.
+
### Prepare
- alexa-top-1m-csv-2-domains.php
+ $ php alexa-top-1m-csv-2-domains.php
The web stats site Alexa publishes a list of the top one million domains. Download it and convert the CSV to a file with one domain per line with this script.
### Download a million files
- download.php
+ $ php download.php
A cURL multihandle is executed in a loop. As downloads complete and the respective handle s are removed, more URLs are added to the multihandle, keeping the number of downloads in progress constant.
I've ran this code reliably for hundreds of thousands of files in a single session.
+It could trivially be made into a more serious tool by reading a URL list from stdin, and using the hash of the URL as the cache file name. Feel free to fork it.
+
### Detecting asciiart
- detect-asciiart.php
+ $ php detect-asciiart.php
-I found only one other algorithm to do it, and it wasn't very good. I just use a few simple heurustics, whith a blacklist beeing the most important. I want to make as few assumpions as possible about what constitutes asciiart. Hence, I don't try to look for more examples of what I have altready seen, but just filter out what I can be sure of is *not* art. This means a more sophisticated approach like a Bayesian fiter is not useful in this context.
+I found only one other algorithm to do it, and it wasn't very good. I just use a few simple heuristics, with a blacklist being the most important. I want to make as few assumptions as possible about what constitutes ASCII art. Hence, I don't try to look for more examples of what I have already seen, but just filter out what I can be sure of is *not* art. This means a more sophisticated approach like a Bayesian filter is not useful in this context.
Your code looks like crap
-------------------------
It does, doesn't it? You are welcome to fork it!
-
View
113 detect-asciiart.php
@@ -1,20 +1,22 @@
<?php
$cacheDirName = 'cache';
+$numFilesProcessed = 0;
+
+/*
// Loop over all sub directories of "cache".
if($outerDirHandle = opendir($cacheDirName)) {
while(false !== ($innerDirName = readdir($outerDirHandle))){
if($innerDirName != '.' && $innerDirName != '..' && is_dir($cacheDirName.'/'.$innerDirName)){
-
// In each subdirectory, loop over all files (pages).
if($innerDirHandle = opendir($cacheDirName.'/'.$innerDirName)) {
while(false !== ($fileName = readdir($innerDirHandle))){
if($fileName != '.' && $fileName != '..' && !is_dir($cacheDirName.'/'.$innerDirName.'/'.$fileName)){
// Process the page.
- print($fileName.'<br>');
+ ++$numFilesProcessed;
handlePage($fileName, gzinflate(file_get_contents($cacheDirName.'/'.$innerDirName.'/'.$fileName)));
}
}
@@ -24,10 +26,53 @@
}
closedir($outerDirHandle);
}
-print($innerDirName);
+
+
+*/
+
+
+$inFile = @fopen('domains.txt', 'r');
+$outFile = @fopen('ascii_art.txt', 'a');
+
+if ($inFile && $outFile) {
+
+ // Read line by line until failing.
+ while(($line = fgets($inFile)) !== false){
+
+ // Extract the domain from the line.
+ $domain = trim($line);
+
+ $innerDirName = substr($domain, 0, min(2, strpos($domain, '.')));
+ $pageContent = gzinflate(file_get_contents($cacheDirName.'/'.$innerDirName.'/'.$domain));
+
+ // We only bother to look at the first comment of the page.
+ $firstComment = extractFirstComment($pageContent);
+
+ if($firstComment && isAsciiArt($firstComment)){
+
+ // Log the comment and the domain name to a file.
+ $domainAndArt = "\n\n\n".$domain."\n\n".$firstComment;
+
+ fwrite($outFile, $domainAndArt);
+
+ print("\n".$domainAndArt);
+ print("\n\n".'processed '.$numFilesProcessed.' files');
+ }
+
+ ++$numFilesProcessed;
+ }
+
+ fclose($outFile);
+ fclose($inFile);
+
+}
+
+print('done');
function handlePage($domain, $pageContent){
+
+ global $numFilesProcessed;
// We only bother to look at the first comment of the page.
$firstComment = extractFirstComment($pageContent);
@@ -37,7 +82,8 @@ function handlePage($domain, $pageContent){
// Log the comment and the domain name to a file.
$domainAndArt = "\n\n\n".$domain."\n\n".$firstComment;
file_put_contents('ascii_art.txt', $domainAndArt, FILE_APPEND);
- print('<pre>'.htmlspecialchars($domainAndArt).'</pre>');
+ print("\n".$domainAndArt);
+ print("\n\n".'processed '.$numFilesProcessed.' files');
}
}
@@ -78,10 +124,20 @@ function isAsciiArt($comment){
if($numChars > 2000)
return false;
+ $numLines = count(explode("\n", $comment));
+
+ // Must be at least 5 lines.
+ if($numLines < 5)
+ return false;
+
+ // Must be less than a page.
+ if($numLines > 40)
+ return false;
+
// Try to block commented-out HTML. Roughly sorted by frequency of occurence as anecdotally spotted in the wild.
foreach(array(
- // Head element stuff.
+ // Head element stuff
'<style',
'<meta',
'<![endif]',
@@ -90,7 +146,7 @@ function isAsciiArt($comment){
'[if lte IE ',
'[if gte IE',
- // Some end-tags.
+ // Some end-tags
'</td>',
'</tr>',
'</script>',
@@ -100,31 +156,46 @@ function isAsciiArt($comment){
'</li>',
'</ul>',
'</object>',
+ '</p>',
+ '</form>',
+ '</body>',
- // Some CMS signatures.
+ // Some CMS signatures
'TYPO3',
'START DEBUG OUTPUT',
'generated',
+ 'XT-Commerce',
+ 'TYPOlight',
+ 'Contao Open Source CMS',
+ 'W3 Total Cache',
+ 'Free CSS Templates',
+ 'DYNAMIC PAGE-SPECIFIC META TAGS WILL BE PLACED HERE',
+ 'vBulletin',
+ 'phpBB',
+ 'Shopsoftware by Gambio',
+ 'BLOX CMS',
+ 'Shopsystem powered by',
+ 'phpwcms',
+ // Misc. garbage
'<rdf:RDF',
'src="',
+ 'Exception]:',
+ 'DoubleClick',
+ 'ct=WEBSITE',
+ 'Unfortunately, Microsoft has added a clever new',
+ 'skype.com/go/skypebuttons',
+ 'eXTReMe Non Public Tracker Code',
+ 'CURRENCY SETTING:',
+ 'This page is valid XHTML 1.0 Transitional',
+ 'Be inspired, but please don\'t steal...',
+ 'This credit note should not be removed',
+ 'chCounter',
+ 'These paths are pathed fo veiwing by a browser',
+ 'MyFonts Webfont Build',
) as $codeFragment)
if(strpos($comment, $codeFragment) !== false)
return false;
-
- $numLines = count(explode("\n", $comment));
-
- // Must be at least 5 lines.
- if($numLines < 5)
- return false;
-
- // Must be less than a page.
- if($numLines > 40)
- return false;
-
- // Must have more than 3 consecutive of the same symbol.
- if(!preg_match('/(.)\1{3}/', $comment))
- return false;
return true;
}
View
53 download.php
@@ -1,6 +1,6 @@
<?php
-$poolSize = 100;
+$poolSize = 20;
$numRequestsInPool = 0;
$numFinishedFiles = 0;
$mh = curl_multi_init();
@@ -10,22 +10,17 @@
while(1){
// Refill pool while its not full and there are lines left in the file.
- while($numRequestsInPool < $poolSize && ($domain = fgets($fileHandle)) !== false){
+ while($numRequestsInPool < $poolSize && ($line = fgets($fileHandle)) !== false){
+
+ $domain = trim($line);
// Check if it's in the cache.
$cachedFilePath = cachedFilePath($domain);
if(is_file($cachedFilePath)){
-// print('<br>in cache: '.$domain);
-//
-// // Process immediately.
-// handlePage($domain, gzinflate(file_get_contents($cachedFilePath)));
-
}else{
// Initiate download.
-// print('<br>added: '.$domain);
-
// Set up curl to download the frontpage.
$URL = 'http://www.'.$domain;
$ch = curl_init($URL);
@@ -34,6 +29,7 @@
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 3,
CURLOPT_TIMEOUT => 60,
+ CURLOPT_CONNECTTIMEOUT => 10,
));
// Remember what domain this handle is downloading.
@@ -43,30 +39,37 @@
curl_multi_add_handle($mh, $ch);
++$numRequestsInPool;
}
-
- flush();
}
-
- // Wait for data.
- curl_multi_select($mh);
- // Process requests.
-// print('<br>processing: ');
+ print("\n\n".'Processing cURL: ');
do {
-// print('*');
- $execReturnValue = curl_multi_exec($mh, $foo);
- } while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
+ print('*');
+ $mrc = curl_multi_exec($mh, $active);
+ } while ($mrc == CURLM_CALL_MULTI_PERFORM);
+
+ if ($active && $mrc == CURLM_OK) {
+
+ print("\n\n".'Waiting for data...');
+ if (curl_multi_select($mh) != -1) {
+
+ print("\n\n".'Processing cURL: ');
+ do {
+ print('*');
+ $mrc = curl_multi_exec($mh, $active);
+ } while ($mrc == CURLM_CALL_MULTI_PERFORM);
+ }
+ }
// Handle finished requests.
- print('<br><br><b>finished:</b>');
+ print("\n\n\n".'Finished:');
while(false !== $handleInfo = curl_multi_info_read($mh)){
// Check if the handle is done.
if($handleInfo['msg'] == CURLMSG_DONE){
$domain = $handleToDomain[$handleInfo['handle']];
- print('<br>'.$domain );
+ print("\n".$domain );
// Read the page from the handle.
$pageContent = curl_multi_getcontent($handleInfo['handle']);
@@ -75,18 +78,14 @@
file_put_contents(cachedFilePath($domain), gzdeflate($pageContent));
++$numFinishedFiles;
-// handlePage($domain , $pageContent);
-
// Remove the handle from the pool.
curl_multi_remove_handle($mh, $handleInfo['handle']);
curl_close($handleInfo['handle']);
unset($handleToDomain[$handleInfo['handle']]);
--$numRequestsInPool;
}
-
- flush();
}
- print('<br><br>downloaded '.$numFinishedFiles.' files this session.');
+ print("\n\n".'Downloaded '.$numFinishedFiles.' files this session.');
// Are we done yet?
if(feof($fileHandle) && !$numRequestsInPool)
@@ -96,7 +95,7 @@
curl_multi_close($mh);
fclose($fileHandle);
- print('<br><br>done');
+ print("\n\n".'done');
}

No commit comments for this range

Something went wrong with that request. Please try again.